Fix LODhtmlparser to Discover Script Tags and Extract Data from JSON-LD and TTL References

Currently, the LODhtmlparser does not handle the discovery of <script> tags within text/HTML content, which may contain references to application/ld+json or text/turtle data. To improve the parser's capability, it needs to be able to detect and extract data from these script tags, particularly those referencing JSON-LD (ld+json) or Turtle (ttl) files.

This enhancement should enable the parser to:

Identify <script> tags within HTML content.
Check for scripts that reference or embed data in application/ld+json and text/turtle formats.
Extract and process the RDF triples from these embedded or referenced resources.

A good example to test this functionality can be found in the head of the following URL:
https://www.rohub.org/046a10d6-e461-4811-acf2-309697ff34db?activetab=overview
This page includes a <script> tag in its head section where application/ld+json data is referenced. The parser should be able to identify this tag, retrieve the referenced data, and extract the RDF triples.

Expected Outcome:

The LODhtmlparser can detect <script> tags in HTML.
It can extract triples from application/ld+json and text/turtle formats referenced in these tags.
The extracted data is processed as RDF triples and handled appropriately by the parser.

Steps to Implement:

Modify the LODhtmlparser to parse HTML content and detect <script> tags.
Add logic to identify application/ld+json and text/turtle scripts.
Implement data extraction and triple generation for the identified scripts.
Create test cases to validate the parser's ability to handle these script tags, using the provided URL as a reference case.

Test Case: The head of https://www.rohub.org/046a10d6-e461-4811-acf2-309697ff34db?activetab=overview contains a script tag with application/ld+json data. The parser should be able to detect this, extract the JSON-LD, and convert it into RDF triples for further processing.

Additional Notes:

Ensure that any edge cases, such as malformed script tags or invalid JSON-LD, are handled gracefully.
Consider expanding support for other RDF serialization formats if needed in the future.

This issue will help improve the parser's ability to handle real-world data embedded in HTML pages and enhance its compatibility with modern web practices.

vliz-be-opsci / py-sema

Fix LODhtmlparser to Discover Script Tags and Extract Data from JSON-LD and TTL References #121