openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Zimit2: Allow deduplication of entries #199

Open benoit74 opened 7 months ago

benoit74 commented 7 months ago

It looks like Zimcheck is complaining about quality issues in most (all?) Zimit2 files.

It already did so for Zimit1, but maybe it is time to address the problems.

The first obvious problem is that lots of content is duplicated inside the ZIM due to different URLs leading to the same content. I think this could be pretty easily addressed (even if it clearly means additional processing to deduplicate).

{
    "check": "redundant",
    "level": "WARNING",
    "message": "solar.lowtechmagazine.com/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png and solar.lowtechmagazine.com/fr/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png",
    "path1": "solar.lowtechmagazine.com/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png",
    "path2": "solar.lowtechmagazine.com/fr/2011/01/aerial-ropeways-automatic-cargo-transport-for-a-bargain/images/dithers/aerial-ropeway-colour-2_dithered.png"
},

For a website like solar.lowtechmagazine.com which is available in multiple languages, it could even make a significant difference in terms of final file size (not sure if compression achieves to cancel duplicated content like this well, at least some persons says it is not possible, e.g. https://superuser.com/a/479083).

rgaudin commented 7 months ago

The new alias might be of help

kelson42 commented 7 months ago

To me Zimcheck "warnings" are not a priority to treat, in particular for the moment. Should be a feature request IMO and descoped from the "Zimit2" project.

One solution proposal for this deduplication feature has been made years ago at scraperlib level.

benoit74 commented 7 months ago

Treating all Zimcheck "warnings" is maybe not a priority, but avoiding to create artificially big ZIMs could be considered from my PoV. I do not mind if we de-scope this.

I don't know why someone proposed a PR to fix https://github.com/openzim/python-scraperlib/issues/33 but never finished the job !

I'm joking of course, I was probably very tired or angry about someone else this day. I intend to finish this PR to fix this zimit2 issue, it was not that far from being OK.