An open-source pdf sanitizer / compressor / processor / info extractor / metadata creator

KOLANICH commented 6 years ago

Project description

I have a lot of PDFs. Most of them are full of garbage like embedded fonts, JavaScript, whitespaces in XMP etc. That's why they take too much space on my HDD and too long to load.

Some of them usually have bad filenames. The original file names of scientific papers are usually gibberish, I have to rename them manually.
Sometimes the compression used is not optimal.
Some of them lack metadata completely, which makes them hard to use with the tools like jabref.
Sometimes publisher website contains bibtex reference enough to populate the metadata. Other metadata may be extracted from the text.
Some of them don't have a built-in bookmarks, so it's hard to navigate them.
Some of them have bookmarks, but change the mode of page display.
Sometimes the text is in wrong codepage, so it looks right because of over/underlay , but when you copy it, it's gibberish.

So we need a tool: 1 parsing pdf files in the folder 2 removing all the unneeded elements like javascript or redundant whitespaces in xmp. 4 removing all the fonts which are usually installed in the systems, such as times new roman or calibri 5 keeping a database of all the font files. Detection of different versions of the same font, combining them into a single font file installed into a system and removing the font from the files. 6 Fixing all the links in the documents according to HTTPSEverywhere rules. 7 Fixing codepage issues. 8 Extraction of metadata from the pdf content, like authors, date, publisher, doi, arxiv and adding it into metadata. 9 Visiting the page by doi link, or in absence of doi link, other link classified as a link to the page of the paper on publisher's website and extracting bibtex info from it, embedding it into pdf metadata 10 getting metadata from other sources like google scholar 11 merging and normalizing the metadata from different sources smartly 12 detection of headers in the pdf text. For example they usually have the larger font than the majority of text. 13 detection and parsing of contents in the text 14 matching parsed contents to headers and creating bookmarks 15 detecting hierarchy in bookmarks, for example, by numbers. 16 merging bookmarks trees: the generated one and the already present one 17 detecting page numbers in headers/footers and making them equal to the ones used in pdf. 18 Compressing the contents with the best compression available. 19 Detecting named entities like chemical compounds and creating highlighting annotations for them

Relevant Technology

https://github.com/kermitt2/grobid - an awesome lib!
python
https://github.com/mstamy2/PyPDF2
https://github.com/KOLANICH/pdfminer.six (a fork of https://github.com/pdfminer/pdfminer.six, there are some problems with accepting my PRs)
chardet
machine-learning frameworks: xgboost, keras, tensorflow
requests and bs4
Docear has some code retrieving metadata from Google Scholar
Natasha
https://github.com/pskryuchkov/ilovescience - some parsers of metadata from PDF contents
https://blog.didierstevens.com/programs/pdf-tools/ and https://cincan.io/tools/pdf-tools/ (thanks to @efx for that)

Who is this for

Any python dev who are willing to spend his own priceless irrecoverable time on such a large project.

Bisaloo commented 6 years ago

Everything related to metadata and file names is handled quite well by Zotero.

KOLANICH commented 6 years ago

I have already tried zotero and was deeply dissatisfied with it.

koppor commented 6 years ago

In JabRef we have the basic helper classes for that ready. The linked Docear is a fork of an old JabRef version and currently unmaintained. Some of our developers don't like Java, too. However, it is IMHO hard to port the functionality to another language. So I would propose to dive into the code.

In the developers, we had hard discussion on point 8. I wrote code to parse author data from the title page. It worked only for IEEE and LNCS papers. Currently, we decided to use GROBIT in JabRef as it provides the highest recognition rate (refs https://github.com/koppor/jabref/issues/327).

wanghaisheng commented 6 years ago

pls do try pdf.js and pdfplumber

KOLANICH commented 6 years ago

Thank you for informing me about pdfplumber, should be definitely helpful.

ghost commented 5 years ago

Couple extra links for mining PDFs from a security perspective:

Didier Steven's excellent pdf-parser.py for low level probing. i. Some folks are building automated workflows for malware analysis of PDFs using them
Though not Python, Phil Harvey's exiftool is the most comprehensive metadata extractor I've seen

KOLANICH commented 1 year ago

@ProfessorNavigator

open-source-ideas / ideas