openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
41 stars 5 forks source link

Fix fuzzy rule for Youtube thumbnails in JS #285

Closed benoit74 closed 1 month ago

benoit74 commented 1 month ago

While fuzzy rule is working well in Python, trailing characters after the ? from the querystring are not removed in Javascript causing the fuzzy rewriting to be incorrect.

E.g. 'i.ytimg.com/vi/-KpLmsAR23I/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgTyg-MA8=&rs=AOn4CLDr-FmDmP3aCsD84l48ygBmkwHg-g is transformed into i.ytimg.com.fuzzy.replayweb.page/vi/-KpLmsAR23I/thumbnail.jpgsqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgTyg-MA8=&rs=AOn4CLDr-FmDmP3aCsD84l48ygBmkwHg-g instead of i.ytimg.com.fuzzy.replayweb.page/vi/-KpLmsAR23I/thumbnail.jpg

This PR fixes the situation by updating the fuzzy rules and adding a minimal test set in Javascript.

Long term solution to test all fuzzy rules in JS is described in #284

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.38%. Comparing base (1cffd0a) to head (a8c5232).

:exclamation: Current head a8c5232 differs from pull request most recent head f946842

Please upload reports for the commit f946842 to get more accurate results.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #285 +/- ## ========================================== - Coverage 83.72% 83.38% -0.35% ========================================== Files 13 13 Lines 1223 1216 -7 Branches 232 230 -2 ========================================== - Hits 1024 1014 -10 - Misses 153 155 +2 - Partials 46 47 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rgaudin commented 1 month ago

I dont think I can be of much help here but the problem description is too vague IMO.

trailing characters are not removed in Javascript

benoit74 commented 1 month ago

Good point, I added an example in first comment.