Open cedricdcc opened 1 day ago
This seems to be a problem on the side of the sites we visited , the site does some server side rendering to inject the needed script tags that we want so we can never get them without faking a browser.
A check to see if this feature is tested does need to be done. If the tests are present in py-sema this issue can be closed
Currently, the LODhtmlparser does not handle the discovery of
<script>
tags within text/HTML content, which may contain references toapplication/ld+json
ortext/turtle
data. To improve the parser's capability, it needs to be able to detect and extract data from these script tags, particularly those referencing JSON-LD (ld+json
) or Turtle (ttl
) files.This enhancement should enable the parser to:
<script>
tags within HTML content.application/ld+json
andtext/turtle
formats.A good example to test this functionality can be found in the head of the following URL:
https://www.rohub.org/046a10d6-e461-4811-acf2-309697ff34db?activetab=overview
This page includes a
<script>
tag in its head section whereapplication/ld+json
data is referenced. The parser should be able to identify this tag, retrieve the referenced data, and extract the RDF triples.Expected Outcome:
<script>
tags in HTML.application/ld+json
andtext/turtle
formats referenced in these tags.Steps to Implement:
<script>
tags.application/ld+json
andtext/turtle
scripts.Test Case: The head of https://www.rohub.org/046a10d6-e461-4811-acf2-309697ff34db?activetab=overview contains a script tag with
application/ld+json
data. The parser should be able to detect this, extract the JSON-LD, and convert it into RDF triples for further processing.Additional Notes:
This issue will help improve the parser's ability to handle real-world data embedded in HTML pages and enhance its compatibility with modern web practices.