Open pvgenuchten opened 2 months ago
In CORDSI using the SPARQL-endpoint we can get hold of the url for the deliverables related to the project e.g.
Query is based on: ?project a eurio:Project. ?project eurio:hasResult ?result. ?result a eurio:Result. ?result eurio:title ?title_res. ?result eurio:url ?url_res.
These deliverable seem to frequently be reports containing knowledge (which we could extract e.g. from the summary/abstract with e.g. NLP techniques or a LLM), however I can not download these reports using a python script. I expect because of some EU webinterface in between that is used to download these documents. Does anyone have any experience with this?
The actual content (mostly of scientific articles) is useful to include as part of the metadata to facilitate LLM and full text search
When harvesting a metadata record, identify which links are relevant to be harvested, define rules to come to decision:
To decide: store the actual resource as a blob or extract textual information from the resource