Closed benoit74 closed 6 months ago
Attention: Patch coverage is 16.66667%
with 5 lines
in your changes are missing coverage. Please review.
Project coverage is 14.91%. Comparing base (
857ae56
) to head (5c71674
). Report is 1 commits behind head on zimit2.
Files | Patch % | Lines |
---|---|---|
src/zimit/zimit.py | 16.66% | 5 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
"Luckily", tests are failing due to https://github.com/openzim/warc2zim/pull/198 (but even once this is merged, we still need to wait for https://github.com/openzim/warc2zim/pull/196)
@mgautierfr I did not asked you for a formal review of this since as far as I've understood you are less experienced with zimit, but do not hesitate to have a look and comment as well
I had to fix the tests by updating the number of expected WARC records from 8 to 7, because we do not have anymore the "weird / unexpected" https://dict.brave.com/edgedl/chrome/dict/en-us-10-1.bdic
in the WARC anymore (item #3 below)
Before:
After:
Review welcomed again, changing a test "to make it works" probably needs to be confirmed to be OK 🤣
Done, commit updated.
👍
Fix #256 Fix #284 Fix #166
This PR adopts browsertrix crawler
1.0.0-beta51.0.0-beta.6.Among other things, this release now handles nicely redirect (https://github.com/webrecorder/browsertrix-crawler/pull/476).
We hence have to remove the handling we've previously done on our side and caused issues (#256). We just keep the cleaning of the URL (remove default ports 443 and 80).
As a side-effect, this will also solve #166 since browsertrix crawler is already permissive in terms of SSL certificates issues. The only SSL issues which will continue to be blocked are the ones where the browser cannot establish at all the connection, like https://panzer-war.com/ were the browser has no cipher in common with the server
Redirect handling has been tested with https://metafilter.com:
Handling of insecure connection withhttps://www.moneyinstructor.com (which still fails without the simplification of check_url):
This PR should not be merged before https://github.com/openzim/warc2zim/pull/196