openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

Return headers in save_file #28

Closed satyamtg closed 4 years ago

satyamtg commented 4 years ago

This fixes #27 by adding with_headers option to return the response headers instead of None in save_file() and save_large_file()

Contains the following changes -

These changes are backwards compatible but add new feature to the current API

codecov[bot] commented 4 years ago

Codecov Report

Merging #28 into master will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #28   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines          743       749    +6     
=========================================
+ Hits           743       749    +6     
Impacted Files Coverage Δ
src/zimscraperlib/download.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 9c2eaab...20c76e3. Read the comment docs.

rgaudin commented 4 years ago

Hum, I don't like it much.

Btw wget may be removed in the future. It was introduced when moving initial stuff to this lib but may not be the best candidate for this task. We should think about the best way to have a robust, efficient, manageable and lightweight downloader in the scraperlib.

satyamtg commented 4 years ago
  • What's the use case? Most of the time, you'll make a a HEAD request prior to downloading stuff, so you can check whether you need to download the file at all. So you'll get the Content-Type there.

I agree. There's not much use cases for this except special ones like openedx (as discussed over slack)

  • We could return the headers from download_file unconditionally as it has zero cost (data's already in the request object) and I don't like params that changes the return type. Sometimes it's required but should be avoided as much as possible.

Did that

  • don't change req to res here. You must have noticed we use this req everywhere in the scrapers. I agree it's not ideal but what's res ? I already see multiple candidates for what this could mean. As this holds a Response object I propose we use resp which seems clear to me. What do you think? In that case we shall update the test as well.

resp sounds better

  • we definitely don't want to parse wget's output. We should encourage users to do a HEAD before running a large download.

Yup. That may not be required for large files, so removed that.