metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

Detect metadata from Arxiv Documents #52

Open dufferzafar opened 2 years ago

dufferzafar commented 2 years ago

Arxiv documents don't have title / author etc metadata.

➜ pdfx https://arxiv.org/pdf/1911.02782.pdf
Document infos:
- CreationDate = D:20200708010812Z
- Creator = LaTeX with hyperref package
- ModDate = D:20200708010812Z
- PTEX.Fullbanner = This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
- Pages = 15
- Producer = pdfTeX-1.40.17
- Trapped = False

References: 77
- URL: 71
- ARXIV: 4
- PDF: 2

PDF References:
- http://www.lrec-conf.org/proceedings/lrec2008/pdf/445_paper.pdf
- http://ceur-ws.org/Vol-2345/paper2.pdf

Perhaps we could use arxiv.py to query Arxiv and get that metadata?