vliz-be-opsci / py-trav-harv

python module that will allow an enduser to perform link traversal on a triple store.
0 stars 0 forks source link

Optimize resource retrieval when asserting paths by maling cache #57

Open cedricdcc opened 1 month ago

cedricdcc commented 1 month ago

Currently travahrv retrieves a resource every time it needs to; It doesn't look if the resource (URI) was already retrieved in the past. This results in the same resource being retrieved multiple times which results in long waitng tiles for some tasks that have a lot of assertion paths that need to be traversal harvested.

A solution for this can be looking at the execution report and retrieving all resources that were harvested already together with their date of harvest and mimetype to assure that all diff mimetypes were harvested.

With this a cache can be made that travharv can use.