openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

Automatically redirect to articles with same checksum #86

Closed benoit74 closed 2 years ago

benoit74 commented 2 years ago

Fix #33

This is clearly a first draft / proposal, open to review / fixes.

This change adds a method add_autodedup_filter to specify which paths should be checked for duplicates (checksums are only computed on this path, and checked for duplicates on this path, in order to limit the footprint).

This change modify the method add_item_for to first compute the SHA256 (to be sure to avoid collision) checksum of content or content at fpath; if duplicate is found, an add_redirect is done instead ; if no duplicate, the checksum is added to a dictionary of checksum => item_path.

codecov[bot] commented 2 years ago

Codecov Report

Merging #86 (a3ad5c5) into master (fe0b5fa) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #86   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         1159      1186   +27     
=========================================
+ Hits          1159      1186   +27     
Impacted Files Coverage Δ
src/zimscraperlib/zim/creator.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fe0b5fa...a3ad5c5. Read the comment docs.

benoit74 commented 2 years ago

@rgaudin @kelson42 I consider this code ready for review now, feel free to comment

kelson42 commented 2 years ago

@rgaudin Could you please review this code?

rgaudin commented 2 years ago

@benoit74 if you have reasons to believe that md5 will generate many exceptions (@kelson42 tells me you mentioned it somewhere), you may take a look at http://cyan4973.github.io/xxHash/ (from https://github.com/openzim/libzim/issues/614)

benoit74 commented 2 years ago

Hi @rgaudin, thank you for the review.

To be honest, I'm embarrassed about it.

First because I'm quite sure to not have the competencies / time to do the requested modifications, second because I'm not sure to really agree about your arguments / PoV.

I won't start a detailed discussion about it here due to the combination of these both issues.

In addition, since you will be the one(s) to support the long term maintenance of this library, I would be more than embarrassed to "force" you to do something that could be proven to be wrong / costly without supporting these consequences myself.

I will hence close this PR and live with my own non-standard implementation in my scrapper.

No worries anyway, I'm not upset at all, I just have my own agenda / priorities and you have yours, quite normal.

rgaudin commented 2 years ago

@benoit74, I'm glad you're not upset ; definitely not my intent but I am a bit disappointed myself. Not willing to do a change or nor having time to is 100% OK. You're volunteering time on this project and we're thankful for it.

Yet, not sharing your disagreements is meh (can't find a better word). We're all doing this in the open so we can all benefit from other's point of views and experience.

We may find time to implement this at some point, or another volunteer could or maybe my assumptions are incorrect and the PR should be merged… but we'd need to know why we're wrong or why you think we are.

I do know how time and resource consuming it can be to argument or demonstrate a point. You think it's not worth it and that's probably true from your POV but that's what collaboration is about.

We understand that time is precious which is why we're OK with pointers, approximations or hunches ; it's all over our repositories. Let us know what you think is wrong and we'll have a chance to think about/research it. I'm not saying we'll do it either way but a criticism is a chance to improve. Thanks again for your contribution. 🙏