structured-data / linter

Structured Data linter
The Unlicense
85 stars 17 forks source link

problem with item-number and/or file-size #46

Closed jaygray0919 closed 6 years ago

jaygray0919 commented 6 years ago

Greg, would take a look at this gist: https://gist.github.com/jaygray0919/4276ce845f53495ff73012faad4cda37

SDL seems to bail-out on the 21st script (from the top). It properly handles supersededBy until references to trailing scripts (below 21).

Have checked this on GSDTT and it's valid - but GSDTT does not generate the hierarchy that is generated by SDL (an important 'customer education' feature that we want to emphasize).

In the past, we've raised issues with SDL and file size. We plan to include SDL links in specific web pages and hope to link to pages with larger data sets than in the test gist.

/jay

gkellogg commented 6 years ago

I’m afraid it’s going to be a while before I can get to this

jaygray0919 commented 6 years ago

thanks Greg. When convenient ...

gkellogg commented 6 years ago

Looks like the problem is that the description field on T020 has a value with unescaped embedded quotes:

"description":"An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate")."
jaygray0919 commented 6 years ago

Thanks for looking at the issue. Here's something I've never seen before: the gist lost the escapes when i pasted the source to the gist. Here is the source for that item:

<script type="application/ld+json" id="T020">{"@context":"http://schema.org/","@type":"Property","@id":"http://purl.bioontology.org/ontology/STY/T020","name":"Acquired Abnormality","description":"An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., \"hernias incarcerate\").","category":{"@id":"http://dbpedia.org/page/Acquired_disorder"},"supersededBy":{"@id":"http://purl.bioontology.org/ontology/STY/T190"},"domainIncludes":[{"@id":"https://id.nlm.nih.gov/mesh/C563164"},{"@id":"https://id.nlm.nih.gov/mesh/D001036"},{"@id":"https://id.nlm.nih.gov/mesh/D003286"},{"@id":"https://id.nlm.nih.gov/mesh/D003750"}, ... ,{"@id":"https://id.nlm.nih.gov/mesh/D058225"},{"@id":"https://id.nlm.nih.gov/mesh/D060905"}]}</script>

i'll put the source in another location that does not strip the \"

jaygray0919 commented 6 years ago

Here is the source file with proper escapes for inline double quotes: https://afdsi.org/test/gkellogg_files/STY.txt

The file presents ~130 @Property statements and ~ 50k @Class statements where classes are assigned to property domains. The @Property structure has a hierarchy defined using supersededBy. I will add property range values as the next step. But range values are less complicated than domain values.

gkellogg commented 6 years ago

There are HTML errors and some of the JSON is not correct, for example:

<script type="application/ld+json" id="T072">{"@context":"http://schema.org/","@type":"Property","@id":"http://purl.bioontology.org/ontology/STY/T072","name":"Physical Object","description":"An object perceptible to the sense of vision or touch.","category":{"@id":"http://dbpedia.org/page/Physical_body"},"supersededBy":{"@id":"http://purl.bioontology.org/ontology/STY/T071"},"domainIncludes":[{"@id":"https://id.nlm.nih.gov/mesh/D008393"},}</script>{"@id":"https://id.nlm.nih.gov/mesh/D019149"},}</script>{"@id":"https://id.nlm.nih.gov/mesh/D054045"},}</script>{"@id":"https://id.nlm.nih.gov/mesh/D058433"}]}</script>

I’d recommend that you wrap the set of script elements in <html><body>, at least for testing, and independently validate all emitted JSON within script tags.

jaygray0919 commented 6 years ago

Well that's embarrassing!

The file https://afdsi.org/test/gkellogg_files/STY.txt has been updated.

GSDTT will not process it because it exceeds their size limit. Therefore, we tested subsets to verify that the subsets are clean and that GSDTT properly mapped @Class to @Property domainIncludes. Then we recombined the subset to produce the file above.

Per your suggestion, here is the HTML version: https://afdsi.org/test/gkellogg_files/STY.html

Using SDL: http://linter.structured-data.org/?url=https:%2F%2Fafdsi.org%2Ftest%2Fgkellogg_files%2FSTY.html No structured data detected.

Is there anything else we can do to prep this file for harvesting by SDL?

gkellogg commented 6 years ago

I think you're basically seeing the problem with parsing such large files, and might consider some alternative. Certainly, consolidating the content of the JSON-LD script elements into a single element with an array of the constituent objects would be more efficient.

Also, this is really too big for an online service (free, at that) such as the SDL to handle. Note that all components are open source, and you can get the same results using a command line "rdf" script, that has sub-commands such as "lint" if you install the "linkeddata" gem; it does require a reasonably recent Ruby installation, but should work.

My method is really to just cut out the non-JSON-LD bits and then see what comes out. I used the W3C HTML validator, but the "rdf" command also provides options --validate and --debug, along with others that might be useful.

You can, of course, also install the "structured-data/linter" repository and run the linter locally, but that's really window dressing. Find some reasonably small repeatable example that illustrates a problem and I'm happy to spend more time on this.

jaygray0919 commented 6 years ago

Thanks for your attention and help here Greg.

We're gonna do two things: 1) find a meaningful subset of the above; 2) use a different smaller example.

We would like to avoid creating a named @graph. The reason is that our axioms are stored in a database and 'assembled' for a specific use case (e.g. the relationships among MeSH substances and STY properties).

But our goal remains the same: use schema.org @Class to specify hierarchies, and schema.org @Property to specific relationships otherwise not defined by schema.org.

SDL is one way for customers to visualize axioms, so we need to find working examples. The ontology-visualizers are good (e.g. LODLive) but don't work for our applications as we define instances of several ontologies and use the above database-approach. Similarly, Roberto Garcia's 'RDF -> SVG' graph generator is good but only works on a single RDF/OWL document. So the 'axioms -> RDFa' generator in SDL, where the RDFa is the 'visualization' of an axiom-set, is unique in the market. Ivan Hermann's RDFa visualize works for HTML/RDFa but not for axiom-sets.

A highly desirable extension to SDL is an SVG-graph generator as used by RobertoG and Rhizomik. Then customers could see the 'sequence' of relationships in addition to the 'list' of relationships.

Be back to you with new/modified examples.

jaygray0919 commented 6 years ago

We've made changes. This file has been updated: https://afdsi.org/test/gkellogg_files/STY.html.

Per your advice we developed a method to convert <script type="application/ld+json"> items stored in our database to an @graph. That file is here: https://afdsi.org/test/gkellogg_files/STY.json

It's too big for SDL and GSDTT, but it is processed by JSON-LD Playground. D3 visualization is active, but Playground doesn't support zoom or have a vertical elevator bar. We have D3 templates and will experiment with processing the @graph.

We also have much smaller graphs that are processed by GSDTT but not by SDL. We can share those examples if they help your analysis.

Our goal is to use http://linter.structured-data.org/?url= on pages with schema.org/JSON-LD content so a reader can visualize the page (which is not possible with JSON-LD Playground AFAICT).

gkellogg commented 6 years ago

Please send me a link or gist to a minimal file that is failing.

gkellogg commented 6 years ago

Note that Linting is a very expensive process, which involves using RDF entailment and other special rules to check that everything conforms to the vocabulary definitions, and that data values are all in range. As the size of the file, the complexity of the reasoning grows as well; what I believe you're dealing with is an HTTP timeout because the request is taking too long. As I mentioned previously, if you use the command-line RDF tools you get from gem install linkeddata, it should run to completion. I ran the JSON file you referenced using rdf lint STY.json (actually .jsonld) to get the appropriate reader invoked, and it completed with no messages, so your data seems good. It did take ~2' 30" to complete, though.

I'm closing, as this is not an issue with the linter, itself. Unfortunately, properly detecting this and providing reasonable user feedback is problematic with the current application architecture.