'papers extract' results in a call with nonsensical arguments to pdftotext

andr-agus commented 10 months ago

Hi,

I've been using 'papers' for quite a while now and this is the first time I've seen this issue. I am trying to extract the bilbiographic info of this article* from its pdf. The program throws this exception:

_Command Line Error: Wrong page range given: the first page (2) can not be after the last page (1). Traceback (most recent call last): File "/usr/bin/papers", line 8, in sys.exit(main()) ^^^^^^ File "/usr/lib/python3.11/site-packages/papers/main.py", line 1091, in main extractcmd(subp, o) File "/usr/lib/python3.11/site-packages/papers/main.py", line 546, in extractcmd print(extract_pdf_metadata(o.pdf, search_doi=not o.fulltext, search_fulltext=True, scholar=o.scholar, minwords=o.word_count, max_query_words=o.word_count, image=o.image)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/papers/extract.py", line 208, in extract_pdf_metadata txt = pdfhead(pdf, maxpages, minwords, image=image) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/papers/extract.py", line 134, in pdfhead txt += readpdf(pdf, first=i, last=i) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/papers/extract.py", line 41, in readpdf sp.check_call(cmd) File "/usr/lib/python3.11/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['pdftotext', '-f', '2', '-l', '2', 'paper.pdf', '/tmp/tmpaq14gv5.txt']' returned non-zero exit status 99.

Apparently, 'papers' is calling 'pdftotext' with arguments that make no sense, so, what is making 'papers' get confused about those arguments?

(Have I mentioned how much I like this program? Cheers!)

*https://www.nature.com/articles/s41567-020-0990-x

perrette commented 10 months ago

Hi @andr-agus,

may I ask which version of papers, pdftotext, operating system etc. you use? For me the paper you link works fine.

> pip install -U papers-cli
...
> papers --version
2.4
> pdftotext -h
pdftotext version 22.02.0
...
> papers extract s41567-020-0990-x.pdf
@article{Bong_2020,
    doi = {10.1038/s41567-020-0990-x},
    url = {https://doi.org/10.1038%2Fs41567-020-0990-x},
    year = 2020,
    month = {aug},
    publisher = {Springer Science and Business Media {LLC}},
    volume = {16},
    number = {12},
    pages = {1199--1205},
    author = {Kok-Wei Bong and An{\'{\i}}bal Utreras-Alarc{\'{o}}n and Farzad Ghafari and Yeong-Cherng Liang and Nora Tischler and Eric G. Cavalcanti and Geoff J. Pryde and Howard M. Wiseman},
    title = {A strong no-go theorem on the Wigner's friend paradox},
    journal = {Nature Physics}
}

Thanks for the good vibes. Mahé

perrette commented 10 months ago

PS:

> papers extract s41567-020-0990-x.pdf --debug
DEBUG:papers:read pdf page: 1
INFO:papers:pdftotext -f 1 -l 1 s41567-020-0990-x.pdf /tmp/tmp_fgh87__.txt
...
> pdftotext -f 1 -l 1 s41567-020-0990-x.pdf out1.txt  
... all fine ...
> pdftotext -f 2 -l 2 s41567-020-0990-x.pdf out2.txt
... all fine ... (this is the command from your log)

So I assume the issue is with your version of pdftotext. Is it too old or too new or ???

perrette / papers

'papers extract' results in a call with nonsensical arguments to pdftotext #63