soilwise-he / harvesters

MIT License
0 stars 0 forks source link

fetch and store described resource as part of harvest #12

Open pvgenuchten opened 2 months ago

pvgenuchten commented 2 months ago

The actual content (mostly of scientific articles) is useful to include as part of the metadata to facilitate LLM and full text search

When harvesting a metadata record, identify which links are relevant to be harvested, define rules to come to decision:

To decide: store the actual resource as a blob or extract textual information from the resource

BerkvensNick commented 2 months ago

In CORDSI using the SPARQL-endpoint we can get hold of the url for the deliverables related to the project e.g.

https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=080166e5c11c5b87&appId=PPGMS

Query is based on: ?project a eurio:Project. ?project eurio:hasResult ?result. ?result a eurio:Result. ?result eurio:title ?title_res. ?result eurio:url ?url_res.

These deliverable seem to frequently be reports containing knowledge (which we could extract e.g. from the summary/abstract with e.g. NLP techniques or a LLM), however I can not download these reports using a python script. I expect because of some EU webinterface in between that is used to download these documents. Does anyone have any experience with this?