pdf-extract instead of pdfreader

inukshuk / anystyle-cli

AnyStyle Command Line Interface

BSD 2-Clause "Simplified" License

56 stars 8 forks source link

pdf-extract instead of pdfreader #6

Open retrography opened 5 years ago

retrography commented 5 years ago

I don't know if you have noticed this project or not: https://github.com/blusquare/pdfextract

It is an abandoned CrossRef project, but this fork still works well for extracting references. The gem does structural analysis on the PDF file, and thus needs literally no input from the user in order to detect the references. Probably a better match than using the raw pdfreader.

The project is MIT-licensed, and doesn't impose restrictions on derivative work.

pdf-extract extract --references glas.pdf > glas.xml
sed -r -e 's/<[^<>]+>//ig' -e 's/^ +//' glas.xml > glas.txt
anystyle -f bib find glas.txt

inukshuk commented 5 years ago

I was not aware of that project, no. You're right, the finder component currently works on plain text so we're losing a lot of valuable information (font styles, metrics, exact positioning) -- using rich text information was supposed to be a next step (if necessary at all). I'd have to look at pdfextract more closely but, yes, it may be a great fit. Meanwhile, cool that you can still plug it into parser module as above.

retrography commented 5 years ago

The naming is a little bit confusing, but this one is actually pdf-extract not pdfextract (The gem name I mean, not the repo name). It doesn't give you rich text, but it automatically detects the reference pages and eliminates the margins with no user intervention. The output is a XML file, with each reference enclosed as plain text in a XML tag. Anystyle really likes the output!