openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Fuzzy-rule for cheatography.com JS #342

Open benoit74 opened 4 days ago

benoit74 commented 4 days ago

cheatography.com has a JS which is requested with a timestamp, duplicating many ZIM entries with exact same content.

Samples to match (there are 69575 such records in last task):

cheatography.com/scripts/useful.min.js?v=2&q=1719438924
cheatography.com/scripts/useful.min.js?v=2&q=1719438930
cheatography.com/scripts/useful.min.js?v=2&q=1719438936
cheatography.com/scripts/useful.min.js?v=2&q=1719438943
cheatography.com/scripts/useful.min.js?v=2&q=1719438950

We need a fuzzy rule to remove the timestamp and save space in the ZIM (file it 17K, so this is 1.18G uncompressed, not sure which share a final ZIM size this is once compressed).