Open satyamtg opened 4 years ago
To me, for the moment, such a feature should better be in python-scraperlib (or any higher level library) because:
Also, as discussed with @kelson42, articles have no checksum in the ZIM. I was led to think that based on zimcheck's duplicates output but it's zimcheck calculating those.
What we could do is have a helper in scraperlib that calculates checksums, stores them and compares them to adjust behavior (create redirects?). That would be extra and should be enabled on a subset of articles via some filtering pattern. The main use case would be for zimit where the scraper has no control over the content. In this case, if the zimcheck reports duplicates, we could enable this mechanism in the recipe by specifying the filtering patterns.
This feature could have a HUGE impact on resources (CPU, RAM, potentially IO) so it's goal will be to clear duplicates for the case it cannot be done in the scraper. Non-generic scrapers should take care of duplicates themselves.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
I will start to work on an implementation of this issue. Will open a PR once I have something ready to review. I will try to follow advises mentioned above
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
Maybe we should better use aliases?
Maybe we should better use aliases?
Doesn't solve anything. We still don't know ahead of adding the entry that it's a duplicate otherwise we'd probably do thing differently depending on the scraper: not include the resource, use an alias or a redirect.
As discussed in https://github.com/openzim/sotoki/pull/162#issuecomment-660452579, it actually seems a bit odd to handle duplicate files in the scrapers. We can instead have a system to redirect have a single copy of a resource and create redirects if that's being duplicated (or fail intelligently so we can handle).