openzim / openedx

Open edX (to zim) scraper
GNU General Public License v3.0
8 stars 7 forks source link

Use pylibzim to create ZIM #70

Closed satyamtg closed 4 years ago

satyamtg commented 4 years ago

This fixes #53 and uses pylibzim to create ZIMs. It currently relies on https://github.com/openzim/python_scraperlib/pull/34 and thus has a requirement from that branch itself. Also fixes #24 which was necessary to make pylibzim work.

Openedx instances have many root-relative links and we correctly fix them to be not root relative but just relative if the page that it points is present in the ZIM or else point to an external URL by adding the instance netloc.

The following changes are made in scraper.py related to link rewriting -

Note this depends on a future release of zimscraperlib

satyamtg commented 4 years ago

Codefactor may complain about complexity (my editor didn't but config is different). In this case we'd moved those methods back

Codefactor doesn't complain so kept it there. Also, the change broke it as the links starting with path_prefix were never fixed, as the absolute links were fixed before. So fixed that.

Also made another module html_processor.py and moved all HTML parsing, dependency downloading and link fixing there. I did this because the scraper.py went too long. Also renamed dl_dependencies to dl_dependencies_and_fix_links and added docstrings for them. Changed the attribute checking for wiki and forum and check value of self.wiki or self.forum now.