Parse JSON-LD in HTML - Githubissues

dantman commented 5 years ago

This library appears to support fetching JSON-LD over HTTP when the whole response is JSON-LD and an application/ld+json is used. However in the real-world a lot of JSON-LD used on the web comes as script tags in the html. I think it would be worthwhile to support this type of linked data.

See https://developers.google.com/search/docs/guides/intro-structured-data for a code example.

For a real-world example look at the source of https://www.apple.com/, you'll find:

<script type="application/ld+json">
    {
        "@context": "http://schema.org",
        "@id": "https://www.apple.com/#organization",
        "@type": "Organization",
        "name": "Apple",
        "url": "https://www.apple.com/",
        "logo": "https://www.apple.com/ac/structured-data/images/knowledge_graph_logo.png?201809210816",
        "contactPoint": [
            {
                "@type": "ContactPoint",
                "telephone": "+1-800-692-7753",
                "contactType": "sales",
                "areaServed": "US"
            },
            {
                "@type": "ContactPoint",
                "telephone": "+1-800-275-2273",
                "contactType": "technical support",
                "areaServed": "US",
                "availableLanguage": ["EN", "ES"] 
            },
            {
                "@type": "ContactPoint",
                "telephone": "+1-800-275-2273",
                "contactType": "customer support",
                "areaServed": "US",
                "availableLanguage": ["EN", "ES"] 
            }
        ],
        "sameAs": [
            "http://www.wikidata.org/entity/Q312",
            "https://www.youtube.com/user/Apple",
            "https://www.linkedin.com/company/apple",
            "https://www.facebook.com/Apple",
            "https://www.twitter.com/Apple"
        ]
    }
</script>

Side note trying to fetch https://www.apple.com dumps a bunch of TypeError: callback is not a function errors from N3Parser.tripleCallback.

timbl commented 5 years ago

Could be interesting to add it to the rdfa parser? Or a separate parser? Presumable one could also parse inline turtle as well.

dantman commented 5 years ago

I'd bet that the RDFa parser does not have a dependency on a JSON-LD parser. But rdflib does and already converts JSON-LD to the format we need.

My expectation is that the optimal thing to do is in the text/html condition, in addition to passing the html to the rdfa parser parse it to a basic dom with a parser we already have and scan it for <script> tags and parse the contents of any type="application/ld+json" scripts with the JSON-LD parser. This may involve double-parsing html. But if we want to avoid that, instead of making the rdfaparser parse non-rdfa we should just make it accept a pre-parsed dom instead of only html strings.

This of course could be expanded to inline turtle or inline versions of any other format rdflib supports.

csarven commented 5 years ago

I don't think the RDFa parser is particularly useful toward extracting and parsing JSON-LD, N3, Turtle, TriG, etc in HTML documents using the script extension mechanism. So, I would agree that would require second parsing. Perhaps a flag can be used to turn it on/off.

dmitrizagidulin commented 5 years ago

+1, this would be a really helpful mechanism.

linkeddata / rdflib.js

Parse JSON-LD in HTML #344