w3c / WebID

https://www.w3.org/groups/cg/webid
MIT License
14 stars 7 forks source link

Data islands vs. RDFa for a human- and machine-readable format #42

Closed jacoscaz closed 7 months ago

jacoscaz commented 8 months ago

/chair hat off

Hi everyone. Often, particularly when it comes to formats, the discussion touches upon whether RDF data islands can be a valid alternative to RDFa for picking a format readable by both humans and machines alike. Let's leave aside, for a moment, the fact that data islands are not a W3C REC and let's focus on the technical side of this issue.

Now, an obligatory disclaimer: nothing in this issue is an attempt at forcing such a format upon the WebID Spec, whatever form that takes. I am, however, interested in your opinion as to the pros and cons of each.

In my humble opinion, data islands are, indeed, much friendlier than RDFa but only insofar as they can be parsed out of HTML without a full-blown DOM/HTML5 parser. To that end, the following code demonstrates a way to do so:

const html_string = `
  <html>
  <body>
  <script type="application/ld+json">
    {
      "@id": "some document"
    }
  </script>
  </body>
  </html>
`;

for (const match of html_string.matchAll(/<script[^>]*?type="application\/ld\+json"[^>]*?>(.*?)<\/script>/sig)) {
  console.log(match[1]);
}

Granted, the above is a crude, inefficient quick hack and it is incapable of supporting edge cases such as a data island that contains a </script> within a JSON-LD string literal. Nonetheless, at least in my case, the above would be more than enough functionally to consider using JSON-LD data islands rather than RDFa.

I think a state machine could be made that would be capable of quickly getting to data islands while discarding everything else and still be orders of magnitude less complex than full DOM/HTML parsing.

Thoughts?

melvincarvalho commented 8 months ago

Data Islands very much are a W3C REC. Not only that, they represent the de-facto semantic web in 2024, via schema.org

https://www.w3.org/TR/json-ld11/#embedding-json-ld-in-html-documents

I have a stub of a similar library, getj here:

https://github.com/spux/getj

Demo:

https://spux.org/getj/test.html

melvincarvalho commented 8 months ago

IMHO RDFa (and XHTML) are technical debt that hold back projects that need to support these old, less popular, formats. A good example being Solid. RDFa holds it back, developers dont want to join, and those that joined before walked away, because modern web devs want to use JSON.

VirginiaBalseiro commented 8 months ago

A good example being Solid. RDFa holds it back, developers dont want to join, and those that joined before walked away, because modern web devs want to use JSON.

Do you have data to back up this claim or is this just your opinion?

webr3 commented 7 months ago

If it helps any, I (heavily involved in RDFa WG, and RDFa API author) ripped out RDFa from many pages (100 million +) and moved our setups to json-ld in data islands (billions of pages).

For interest they all.utilize the data islands as data in js also.

webr3 commented 7 months ago

for (const match of html_string.matchAll(/<script[^>]?type="application\/ld+json"[^>]?>(.*?)<\/script>/sig)) { console.log(match[1]); }

    globalThis.di = Array.from(document.querySelectorAll('[type="application/ld+json"]')).map(function(island){ return [island.id, JSON.parse(island.text)]}).reduce(function(obj, item) {
      obj[item[0]] = item[1]
      return obj
    }, {});
melvincarvalho commented 7 months ago

A good example being Solid. RDFa holds it back, developers dont want to join, and those that joined before walked away, because modern web devs want to use JSON.

Do you have data to back up this claim or is this just your opinion?

A bit of both. I founded the Solid Community Group and am in touch with many people there, and before it. I also have traffic statistics from reddit. I created the biggest and most popular Solid Pod, and ran it for 1/4 of a decade until I got sick. I also follow the github interest in solid. While the project is extremely well funded, developer interest has waned from its peak. RDFa is hard to work with, and web developers like JSON. RDFa is also enormously buggy. Compare the triples on your own webid, in the RDFa, and that of the turtle. They are not the same, last I checked. I'm sure it will all get fixed eventually given the long runway that Solid has, but working with JSON allows other projects in the open (social) web, to progress enormously fast. I helped on board 1000s of developers onto the open (social) web, and JSON is one of the big sellers. People will look at Solid and say "interesting" but then go and work on a JSON project.

melvincarvalho commented 7 months ago

Try this:

npx getj <uri_with_data_island>

for example

npx getj https://spux.org/getj/test.html

gives

{
  "@context": "http://schema.org",
  "@type": "WebPage",
  "url": "https://example.com",
  "name": "Example Web Page"
}

If there's interest I can donate this npm library to the CG and we can collaborate on a function that will extract data islands from command line, browser, or server

jacoscaz commented 7 months ago

@webr3 @melvincarvalho both of your implementations rely on a full-blown DOM/HTML5 parser, though, as provided by either the browser or by dependencies. Ugly and hack-ish as it is, my code doesn't rely on anything but the obvious JSON-LD parser one would need anyway.

IMHO, compared to RDFa, which has its own media type and doesn't force a client to rely on heuristics, Data Islands (or Blocks, according to the JSON-LD spec) make sense only if they allow devs to dispense with the complexity of parsing HTML5 or, worse, of an in-memory DOM representation. Otherwise one would already be 80% there to RDFa support.

melvincarvalho commented 7 months ago

Ugly and hack-ish as it is, my code doesn't rely on anything but the obvious JSON-LD parser one would need anyway.

Mine was indeed an ugly hack too. But we could make a half decent library if we work together, I suspect.

jacoscaz commented 7 months ago

Would anyone object to converting this issue into a discussion?