lukasschwab / arxiv.py

Python wrapper for the arXiv API
MIT License
1.07k stars 120 forks source link

Modified "download" so it can download source .tar.gz file #37

Closed Miguel-ASM closed 4 years ago

Miguel-ASM commented 4 years ago

Description

Modified "download" so it can download source .tar.gz file

To download source, the keyword argument "source" must be passed with True value. Otherwise it downloads the .pdf.

Relevant issues

download preprint source files #36

Checklist

lukasschwab commented 4 years ago

Thanks for the PR! I'll aim to review it tonight.

Miguel-ASM commented 4 years ago

Thanks for the review!

I am kind of busy at work right now, but I will try to tackle everything.

Cheers!

El jue., 20 feb. 2020 a las 6:10, Lukas Schwab (notifications@github.com) escribió:

@lukasschwab requested changes on this pull request.

Requested some changes below––most significantly, we need to check the content type header to select a download path.

Also, a few more requests:

  1. README.md needs to be updated to reflect the API change (the new optional parameter).
  2. Two cases should be added to tests/test_download.py:
    1. A source download when a tarfile source is available.
    2. A source download when a tarfile source is unavailable, which should produce a PDF.

Can use https://arxiv.org/pdf/2002.07816 as a tarfile example and https://arxiv.org/pdf/1707.08567v1 as a PDF example.

Feel free to tackle this and tag me for a re-review!

If I've outlined too much work, let me know––I can tag this as a feature request and implement it myself (probably in the next week or so). I think the functionality is certainly useful.

In arxiv/arxiv.py https://github.com/lukasschwab/arxiv.py/pull/37#discussion_r381749530:

@@ -233,12 +233,22 @@ def slugify(obj):

 return filename

-def download(obj, dirpath='./', slugify=slugify):

+def download(obj, dirpath='./', slugify=slugify, source=False):

  • """

  • Download the .pdf ob object. If source==True download the

Please remove the extra ob here.

In arxiv/arxiv.py https://github.com/lukasschwab/arxiv.py/pull/37#discussion_r381755861:

 if not obj.get('pdf_url', ''):
     print("Object has no PDF URL.")

     return

 if dirpath[-1] != '/':

     dirpath += '/'
  • path = dirpath + slugify(obj) + '.pdf'

  • urlretrieve(obj['pdf_url'], path)

  • if source:

  • url = re.sub(r'/pdf/', r'/e-print/',obj['pdf_url'])

  • path = dirpath + slugify(obj) + '.tar.gz'

Major issue: this is not guaranteed to point to a tarfile; indeed, in many cases it will just be a PDF. From arXiv's stock /format descriptions: https://arxiv.org/format/2001.10010

Source Delivered as a gzipped tar (.tar.gz) file if there are multiple files, otherwise as a PDF file, or a gzipped TeX, DVI, PostScript or HTML (.gz, .dvi.gz, .ps.gz or .html.gz) file depending on submission format.

For example, https://arxiv.org/e-print/2002.07816 is a tarfile; https://arxiv.org/e-print/1707.08567v1 is a PDF.

The correct behavior here would be to select a filename based on the Content-Type of the e-print address. Two ways of doing this:

1.

Easy but slow: first make a HEAD request to the e-print URL; if application/x-eprint-tar, proceed for a tarfile; else, proceed for a PDF. 2.

Fast, but with filesystem complexity: urlretrieve the e-print URL without setting the destination path, which puts the file in a temporary directory. Look up the Content-Type header in the returned tuple, construct a content-type-dependent path, and move the file from its temporary location to that path.


In arxiv/arxiv.py https://github.com/lukasschwab/arxiv.py/pull/37#discussion_r381756646:

@@ -233,12 +233,22 @@ def slugify(obj):

 return filename

-def download(obj, dirpath='./', slugify=slugify):

+def download(obj, dirpath='./', slugify=slugify, source=False):

I'd prefer a more descriptive argument name here; source has multiple meanings in the context of a download, and (as discussed below) this might still yield a PDF download.

Suggestion: prefer_source_tarfile=False.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lukasschwab/arxiv.py/pull/37?email_source=notifications&email_token=AEXTT4YYOA5UVBIESA7IF3LRDYGFFA5CNFSM4KX3LHT2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCWHAY5A#pullrequestreview-361630836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXTT45AFXKVXXE3J6L7CXDRDYGFFANCNFSM4KX3LHTQ .

-- Miguel Ángel Simón Martínez, MSc

Quantum Information, Science and Technology Group Departamento de Química Física Universidad del País Vasco (UPV/EHU) miguelangel.simon@ehu.eus

Miguel-ASM commented 4 years ago

@lukasschwab I addressed the changes. However, I found that to be sure that what is being downloaded is a tarfile, I just changed e-print to /src in the URIs. See the following from arXiv mediatypes:

Source files arXiv supports two different URI patterns for accessing source files, /e-print/ID and /src/ID. The /e-print/ID form delivers a single file with the appropriate content type for single-file submission, or a gzipped tar file for multiple source files (most TeX s ubmissions or multi-file PS). The /src/ID form always delivers a gzipped tar file.

The new test cases I provide only check that the downloaded file is created correctly, as in the other tests of the class, since it is now guaranteed to be a gzipped file.

lukasschwab commented 4 years ago

Thanks for the updates! I will re-review tonight.

At first glance, the /src/ work-around seems great.

lukasschwab commented 4 years ago

Looks good! I'll merge, make some minor fixups, and then roll PyPI release (v0.5.3).

@Miguel-ASM thank you for this contribution! You'll appear in the "Contributors" image in the README automatically when the service that generates it invalidates their cache.

Miguel-ASM commented 4 years ago

@lukasschwab My pleasure to help. I hope to contribute again any time soon to your project .