vliz-be-opsci / py-sema

Overall parent of all packages involving semantic manipulation of RDF data.
MIT License
0 stars 0 forks source link

Fix LODhtmlparser to Discover Script Tags and Extract Data from JSON-LD and TTL References #121

Open cedricdcc opened 1 day ago

cedricdcc commented 1 day ago

Currently, the LODhtmlparser does not handle the discovery of <script> tags within text/HTML content, which may contain references to application/ld+json or text/turtle data. To improve the parser's capability, it needs to be able to detect and extract data from these script tags, particularly those referencing JSON-LD (ld+json) or Turtle (ttl) files.

This enhancement should enable the parser to:

A good example to test this functionality can be found in the head of the following URL:
https://www.rohub.org/046a10d6-e461-4811-acf2-309697ff34db?activetab=overview
This page includes a <script> tag in its head section where application/ld+json data is referenced. The parser should be able to identify this tag, retrieve the referenced data, and extract the RDF triples.

Expected Outcome:

Steps to Implement:

  1. Modify the LODhtmlparser to parse HTML content and detect <script> tags.
  2. Add logic to identify application/ld+json and text/turtle scripts.
  3. Implement data extraction and triple generation for the identified scripts.
  4. Create test cases to validate the parser's ability to handle these script tags, using the provided URL as a reference case.

Test Case: The head of https://www.rohub.org/046a10d6-e461-4811-acf2-309697ff34db?activetab=overview contains a script tag with application/ld+json data. The parser should be able to detect this, extract the JSON-LD, and convert it into RDF triples for further processing.

Additional Notes:


This issue will help improve the parser's ability to handle real-world data embedded in HTML pages and enhance its compatibility with modern web practices.

cedricdcc commented 1 day ago

This seems to be a problem on the side of the sites we visited , the site does some server side rendering to inject the needed script tags that we want so we can never get them without faking a browser.

A check to see if this feature is tested does need to be done. If the tests are present in py-sema this issue can be closed