heuristicus / paper-utils

Utilities for document similarity and reference extraction for research papers
MIT License
0 stars 0 forks source link

paper-utils

Utilities for document similarity and reference extraction for research papers

Both utilities expect input in the form of text files. If you have a directory of pdf files, you can convert them using pdftotext on linux. You should use the -raw switch to make sure that text in two columns is not garbled. To convert all pdf files in the current directory to txt, outputting with the same filename, just with a .txt extension, use

find . -type f -name *.pdf -exec pdftotext -raw {} \;

Alternatively, you can use pdfminer, a python utility which should give similar results.