satyamtg commented 4 years ago

This fixes #27 by adding with_headers option to return the response headers instead of None in save_file() and save_large_file()

Contains the following changes -

~~with_headers option to return the response headers instead of None in save_file() and save_large_file()~~ We return headers by default in save_file only
req renamed to ~~res~~ resp in save_file() as that's essentially a response instead of a request
updated tests to test the headers (via assert_headers())
Retries and timeout is added in save_file()

These changes are backwards compatible but add new feature to the current API

codecov[bot] commented 4 years ago

Codecov Report

Merging #28 into master will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #28   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines          743       749    +6     
=========================================
+ Hits           743       749    +6

Impacted Files	Coverage Δ
src/zimscraperlib/download.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 9c2eaab...20c76e3. Read the comment docs.

rgaudin commented 4 years ago

Hum, I don't like it much.

What's the use case? Most of the time, you'll make a a HEAD request prior to downloading stuff, so you can check whether you need to download the file at all. So you'll get the Content-Type there.
We could return the headers from download_file unconditionally as it has zero cost (data's already in the request object) and I don't like params that changes the return type. Sometimes it's required but should be avoided as much as possible.
don't change req to res here. You must have noticed we use this req everywhere in the scrapers. I agree it's not ideal but what's res ? I already see multiple candidates for what this could mean. As this holds a Response object I propose we use resp which seems clear to me. What do you think? In that case we shall update the test as well.
we definitely don't want to parse wget's output. We should encourage users to do a HEAD before running a large download.

Btw wget may be removed in the future. It was introduced when moving initial stuff to this lib but may not be the best candidate for this task. We should think about the best way to have a robust, efficient, manageable and lightweight downloader in the scraperlib.

satyamtg commented 4 years ago

What's the use case? Most of the time, you'll make a a HEAD request prior to downloading stuff, so you can check whether you need to download the file at all. So you'll get the Content-Type there.

I agree. There's not much use cases for this except special ones like openedx (as discussed over slack)

We could return the headers from download_file unconditionally as it has zero cost (data's already in the request object) and I don't like params that changes the return type. Sometimes it's required but should be avoided as much as possible.

Did that

don't change req to res here. You must have noticed we use this req everywhere in the scrapers. I agree it's not ideal but what's res ? I already see multiple candidates for what this could mean. As this holds a Response object I propose we use resp which seems clear to me. What do you think? In that case we shall update the test as well.

resp sounds better

we definitely don't want to parse wget's output. We should encourage users to do a HEAD before running a large download.

Yup. That may not be required for large files, so removed that.

openzim / python-scraperlib

Return headers in save_file #28

Codecov Report