lblod / app-lblod-harvester

Harvesting Self Service
MIT License
1 stars 4 forks source link

Feature/split files #58

Closed nbittich closed 7 months ago

nbittich commented 9 months ago

I've set it as a draft pull request because it involves a lot of changes everywhere, would be better to let it run for a while on dev before merging / releasing.

After thinking a bit, the diff might not be as accurate, for example if an url link is removed from one of the html pages, we won't be able to add it to the "to-remove.ttl".

On my machine one job takes 25 minutes to finish (4 minutes without the split) -> probably due to more IO?

I started to work on pagination, but after checking the virtuoso.ini file for this project, we set the limit to 1 millions so I stopped (it's only done for the import service, validation & diff service). If we want it everywhere, please let me know.

nbittich commented 7 months ago

rebased in #63