Change some variable names - many variable names are changed to sound a bit clearer. For example, path has been changed to relative_path, where applicable, book_list_list has been changed to book_lists, head is changed to head_course_xblock, top is changed to course_tabs (as this basically holds links in the nav menu at the top). There are of course several other small variable name changes (mostly in scraper.py and the xblock_extractor objects)
Use ZimInfo to build ZIM - We use zimscraperlib function to build ZIMs in a similar way we do for TED
Use pathlib instead of os.path - I've tried to implement all paths using pathlib
Refactor xblocks_extractractors - The various objects shared a lot of code. So, I created a base class. There's a bit more we can do here. But I didn't want to combine everything into this PR as it's already quite a lot of changes. Will open tickets for this after the merge.
Use proper logging from zimscraperlib - We now use logger from zimscraperlib to have better logs
Removed unused imports - There were many unused imports. I've tried to remove all of them. I've also made the import ordering better.
Use a more sensible folder structure - The folders inside the course directories had a folder in them with the same name which held the index.html . I've removed this and this makes it a bit simpler to understand the folder structure.
Please note that utils.py is not fully refactored in this as adding other features/ fixing issues would involve changing that file largely. Also. currently the language is set to "eng" for the ZIM. I'll change that while fixing #49
This shall close the following - #45, #46, #43
The argument structure is now modified as follows -
usage: openedx2zim [-h] --course-url COURSE_URL --email EMAIL [--password PASSWORD] --name NAME
[--title TITLE] [--description DESCRIPTION] [--creator CREATOR]
[--publisher PUBLISHER] [--tags TAGS] [--convert-in-webm]
[--ignore-missing-xblocks] [--lang LANG] [--add-wiki] [--add-forum]
[--output OUTPUT_DIR] [--tmp-dir TMP_DIR] [--zim-file FNAME]
[--no-fulltext-index] [--no-zim] [--keep] [--debug] [--version]
Scraper to create ZIM files MOOCs on openedx instances
optional arguments:
-h, --help show this help message and exit
--course-url COURSE_URL
URL of the course you wnat to scrape
--email EMAIL Your registered e-mail ID on the platform. Used for authentication
--password PASSWORD The password to your registered account on the platform. If you don't
provide one here, you'll be asked for it later
--name NAME ZIM name. Used as identifier and filename (date will be appended)
--title TITLE Custom title for your ZIM. Based on MOOC otherwise.
--description DESCRIPTION
Custom description for your ZIM. Based on MOOC otherwise.
--creator CREATOR Name of content creator
--publisher PUBLISHER
Custom publisher name (ZIM metadata)
--tags TAGS List of comma-separated Tags for the ZIM file. category:openedx, openedx,
and _videos:yes (if present) added automatically
--convert-in-webm Re-encode videos to WebM
--ignore-missing-xblocks
Ignore unsupported content (xblock)
--lang LANG Default language of the interface and the ZIM content (ISO-639-1 codes)
--add-wiki Add wiki (if available) to the ZIM
--add-forum Add forum (if available) to the ZIM
--output OUTPUT_DIR Output folder for ZIM file
--tmp-dir TMP_DIR Path to create temp folder in. Used for building ZIM file. Receives all
data
--zim-file FNAME ZIM file name (based on --name if not provided)
--no-fulltext-index Don't index the scraped content in the ZIM
--no-zim Don't produce a ZIM file, create build folder only.
--keep Don't remove build folder on start (for debug/devel)
--debug Enable verbose output
--version Display scraper version and exit
Here's a test ZIM that I created -
test_2020-07.zip (Please ignore the random strings that I put for title and description)
I've matched this with what the original scraper without any changes did and they seem identical. There is a scope of improvements but we're not looking at the frontend part now. Maybe sometime later.
This basically does the following refactors -
Please note that utils.py is not fully refactored in this as adding other features/ fixing issues would involve changing that file largely. Also. currently the language is set to "eng" for the ZIM. I'll change that while fixing #49
This shall close the following - #45, #46, #43
The argument structure is now modified as follows -
Here's a test ZIM that I created - test_2020-07.zip (Please ignore the random strings that I put for title and description)
I've matched this with what the original scraper without any changes did and they seem identical. There is a scope of improvements but we're not looking at the frontend part now. Maybe sometime later.