Closed satyamtg closed 4 years ago
Codefactor may complain about complexity (my editor didn't but config is different). In this case we'd moved those methods back
Codefactor doesn't complain so kept it there. Also, the change broke it as the links starting with path_prefix were never fixed, as the absolute links were fixed before. So fixed that.
Also made another module html_processor.py
and moved all HTML parsing, dependency downloading and link fixing there. I did this because the scraper.py went too long. Also renamed dl_dependencies
to dl_dependencies_and_fix_links
and added docstrings for them. Changed the attribute checking for wiki and forum and check value of self.wiki
or self.forum
now.
This fixes #53 and uses pylibzim to create ZIMs. It currently relies on https://github.com/openzim/python_scraperlib/pull/34 and thus has a requirement from that branch itself. Also fixes #24 which was necessary to make pylibzim work.
Openedx instances have many root-relative links and we correctly fix them to be not root relative but just relative if the page that it points is present in the ZIM or else point to an external URL by adding the instance netloc.
The following changes are made in scraper.py related to link rewriting -
get_course_tabs()
which only gets the course tabs and the new annex() method actually downloads the content.get_course_tabs()
is reused inrewrite_internal_links()
.get_course_tabs()
as we do not offline all tabs).handle_jump_to_path()
compares the jump_to type URL and finds the xblock with that URL from the list of xblock_extractor objects, and checks if the xblock is a vertical or course and returns the modified link. As only course and vertical have HTMLs, we look at the descendants for linkable xblocks too here.relative_dots()
prepares a path of backward jumps, according to the number of parts in the pathupdate_root_relative_path()
writes ensures that no root relative URLs are left out by putting theinstance_url
in place of the netloc.rewrite_internal_links()
is the main manager method. It calls the other functions. In case of jump_to links, if in the first try we do not get a path, we try with the parent as it may be pointing to an xblock with which the vertical xblock is made.Note this depends on a future release of zimscraperlib