Refactored file downloading

openzim / openedx

Open edX (to zim) scraper

GNU General Public License v3.0

8 stars 7 forks source link

Refactored file downloading #38

Closed satyamtg closed 4 years ago

satyamtg commented 4 years ago

This introduces the following changes -

Use save_large_file() from zimscraperlib.download for downloading instead of custom implementation
Use python-magic for filetype identification instead of headers

This fixes #36

~~Will be opened for review once scraperlib is updated~~ ~~Depends on https://github.com/openzim/python_scraperlib/pull/28~~ This now uses HEAD requests

rgaudin commented 4 years ago

Quick comment: magic is a slow process. We usually want to use it when we don't have another way to get it or when the chances of an erroneous info is high and the consequences important.

satyamtg commented 4 years ago

Quick comment: magic is a slow process. We usually want to use it when we don't have another way to get it or when the chances of an erroneous info is high and the consequences important.

Okay. I know magic is slow but did that because using save_large_file from zimscraperlib.download means we do not get headers and doing another request would mean a bit longer wait time. Maybe we should refactor save_large_file() in zimscraperlib to also return headers if required. I have made this PR to solve it - https://github.com/openzim/python_scraperlib/pull/28. Will revert to original way of filetype checking for now

satyamtg commented 4 years ago

So, this uses save_large_file() from zimscraperlib.download. Also, I've made the following changes -

Use a HEAD request to get headers
Make prepare_url() independent of download() as it basically made a downloadable URL from different parts of URL supplied
get_headers() is introduced to do the HEAD request and get headers
Version of mistune is fixed
Unnecessary try-except blocks while downloading have been removed (as we except subprocess.CalledProcessError in the download function)