openzim / mindtouch

libretexts.org to ZIM scraper
GNU General Public License v3.0
0 stars 1 forks source link

Many fixes for reliability of the scraper #78

Closed benoit74 closed 4 days ago

benoit74 commented 1 week ago

Fix #74 Fix #76 Fix #77 (and glossary had the same problem)

Workaround for #71 (real solution postponed to "later") and many other likely situations where we encounter an "unknown" src/href/srcset (inline JS and CSS, ...)

Changes: see list of commits

Some remarks:

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 13.23529% with 118 lines in your changes missing coverage. Please review.

Project coverage is 43.14%. Comparing base (5236501) to head (ad99067). Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
scraper/src/mindtouch2zim/processor.py 1.04% 95 Missing :warning:
scraper/src/mindtouch2zim/asset.py 23.07% 20 Missing :warning:
scraper/src/mindtouch2zim/entrypoint.py 0.00% 1 Missing :warning:
scraper/src/mindtouch2zim/html_rewriting.py 88.88% 1 Missing :warning:
scraper/src/mindtouch2zim/utils.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #78 +/- ## ========================================== - Coverage 43.85% 43.14% -0.72% ========================================== Files 15 15 Lines 969 978 +9 Branches 133 133 ========================================== - Hits 425 422 -3 - Misses 529 545 +16 + Partials 15 11 -4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

benoit74 commented 1 week ago

For the record, I abused the dev Docker image by building from this branch as well, just to be able to run the scraper asap in Zimfarm, since we did not release 0.1, who cares

benoit74 commented 4 days ago

I've opened https://github.com/openzim/mindtouch/issues/91 for the Exception, there are many more than the ones you're mentioning here. But good point indeed