106 was due to iframes and getting iframes can be very tricky as they can contain any random type of content.
However, since the iframes seem to be a part of the course, this allows the scraper to try and get the iframes recursively. (Though it cannot get all content always).
For example, in one of the iframes, everything is in a script tag and hence it's difficult to parse that and get every asset downloaded.
This does the following changes -
introduce two new optional arguments to dl_dependencies_and_fix_links() namely netloc and path_on_server, which allows running the HTML parsing and asset downloading on other URLs than the instance URL (in case of iframes).
prepare_url() in utils.py is modified to now consider the current working path on server to get assets more reliably
Also, unrelated -
Return the original root path string in get_root_from_asset() if path_from_html is empty
It's important to note that though this makes the scraper try to get iframes, it does not guarantee that iframes will always be 100% properly scraped. I mean we would need a generic scraper for that.
106 was due to iframes and getting iframes can be very tricky as they can contain any random type of content.
However, since the iframes seem to be a part of the course, this allows the scraper to try and get the iframes recursively. (Though it cannot get all content always).
For example, in one of the iframes, everything is in a script tag and hence it's difficult to parse that and get every asset downloaded.
This does the following changes -
Also, unrelated -
It's important to note that though this makes the scraper try to get iframes, it does not guarantee that iframes will always be 100% properly scraped. I mean we would need a generic scraper for that.