openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
40 stars 5 forks source link

Use mimetype to selectively rewrite only html documents #315

Closed benoit74 closed 2 weeks ago

benoit74 commented 2 weeks ago

Fix #313

Nota: there is no additional test because I failed to reproduce the issue with the crawler. In general, PDFs are retrieved with Direct Fetch by the crawler, and in such a case we do not have the WARC-Resource-Type header.

This also explains why I did not encountered this situation during my tests before 2.0.1 release.

Under some condition Direct Fetch is not used by the crawler ... and we obviously encountered the situtation in production.

codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.06%. Comparing base (f85c8d8) to head (5cb3a75).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #315 +/- ## ======================================= Coverage 84.06% 84.06% ======================================= Files 14 14 Lines 1268 1268 Branches 249 249 ======================================= Hits 1066 1066 Misses 155 155 Partials 47 47 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.