Automatically redirect to articles with same checksum

openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers

GNU General Public License v3.0

18 stars 16 forks source link

Automatically redirect to articles with same checksum #33

Open satyamtg opened 4 years ago

satyamtg commented 4 years ago

As discussed in https://github.com/openzim/sotoki/pull/162#issuecomment-660452579, it actually seems a bit odd to handle duplicate files in the scrapers. We can instead have a system to redirect have a single copy of a resource and create redirects if that's being duplicated (or fail intelligently so we can handle).

kelson42 commented 4 years ago

To me, for the moment, such a feature should better be in python-scraperlib (or any higher level library) because:

This is too smart to be done in the libzim
I believe basically the scraper (not the libzim) should be able to do things in a clean manner
I understand under certain special conditions the high level scraper might better rely for a certain range of articles of a lower level smart feature like this

rgaudin commented 3 years ago

Also, as discussed with @kelson42, articles have no checksum in the ZIM. I was led to think that based on zimcheck's duplicates output but it's zimcheck calculating those.

What we could do is have a helper in scraperlib that calculates checksums, stores them and compares them to adjust behavior (create redirects?). That would be extra and should be enabled on a subset of articles via some filtering pattern. The main use case would be for zimit where the scraper has no control over the content. In this case, if the zimcheck reports duplicates, we could enable this mechanism in the recipe by specifying the filtering patterns.

This feature could have a HUGE impact on resources (CPU, RAM, potentially IO) so it's goal will be to clear duplicates for the case it cannot be done in the scraper. Non-generic scrapers should take care of duplicates themselves.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

benoit74 commented 2 years ago

I will start to work on an implementation of this issue. Will open a PR once I have something ready to review. I will try to follow advises mentioned above

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 7 months ago

Maybe we should better use aliases?

rgaudin commented 7 months ago

Maybe we should better use aliases?

Doesn't solve anything. We still don't know ahead of adding the entry that it's a duplicate otherwise we'd probably do thing differently depending on the scraper: not include the resource, use an alias or a redirect.