Closed benoit74 closed 2 years ago
Merging #86 (a3ad5c5) into master (fe0b5fa) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #86 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 31 31
Lines 1159 1186 +27
=========================================
+ Hits 1159 1186 +27
Impacted Files | Coverage Δ | |
---|---|---|
src/zimscraperlib/zim/creator.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update fe0b5fa...a3ad5c5. Read the comment docs.
@rgaudin @kelson42 I consider this code ready for review now, feel free to comment
@rgaudin Could you please review this code?
@benoit74 if you have reasons to believe that md5 will generate many exceptions (@kelson42 tells me you mentioned it somewhere), you may take a look at http://cyan4973.github.io/xxHash/ (from https://github.com/openzim/libzim/issues/614)
Hi @rgaudin, thank you for the review.
To be honest, I'm embarrassed about it.
First because I'm quite sure to not have the competencies / time to do the requested modifications, second because I'm not sure to really agree about your arguments / PoV.
I won't start a detailed discussion about it here due to the combination of these both issues.
In addition, since you will be the one(s) to support the long term maintenance of this library, I would be more than embarrassed to "force" you to do something that could be proven to be wrong / costly without supporting these consequences myself.
I will hence close this PR and live with my own non-standard implementation in my scrapper.
No worries anyway, I'm not upset at all, I just have my own agenda / priorities and you have yours, quite normal.
@benoit74, I'm glad you're not upset ; definitely not my intent but I am a bit disappointed myself. Not willing to do a change or nor having time to is 100% OK. You're volunteering time on this project and we're thankful for it.
Yet, not sharing your disagreements is meh (can't find a better word). We're all doing this in the open so we can all benefit from other's point of views and experience.
We may find time to implement this at some point, or another volunteer could or maybe my assumptions are incorrect and the PR should be merged… but we'd need to know why we're wrong or why you think we are.
I do know how time and resource consuming it can be to argument or demonstrate a point. You think it's not worth it and that's probably true from your POV but that's what collaboration is about.
We understand that time is precious which is why we're OK with pointers, approximations or hunches ; it's all over our repositories. Let us know what you think is wrong and we'll have a chance to think about/research it. I'm not saying we'll do it either way but a criticism is a chance to improve. Thanks again for your contribution. 🙏
Fix #33
This is clearly a first draft / proposal, open to review / fixes.
This change adds a method
add_autodedup_filter
to specify which paths should be checked for duplicates (checksums are only computed on this path, and checked for duplicates on this path, in order to limit the footprint).This change modify the method
add_item_for
to first compute the SHA256 (to be sure to avoid collision) checksum ofcontent
or content atfpath
; if duplicate is found, anadd_redirect
is done instead ; if no duplicate, the checksum is added to a dictionary of checksum => item_path.