open-source-ideas / ideas

💡 Looking for inspiration for your next open source project? Or perhaps you've got a brilliant idea you can't wait to share with others? Open Source Ideas is a community built specifically for this! 👋
6.59k stars 220 forks source link

An open-source pdf sanitizer / compressor / processor / info extractor / metadata creator #46

Open KOLANICH opened 6 years ago

KOLANICH commented 6 years ago

Project description

I have a lot of PDFs. Most of them are full of garbage like embedded fonts, JavaScript, whitespaces in XMP etc. That's why they take too much space on my HDD and too long to load.

So we need a tool: 1 parsing pdf files in the folder 2 removing all the unneeded elements like javascript or redundant whitespaces in xmp. 4 removing all the fonts which are usually installed in the systems, such as times new roman or calibri 5 keeping a database of all the font files. Detection of different versions of the same font, combining them into a single font file installed into a system and removing the font from the files. 6 Fixing all the links in the documents according to HTTPSEverywhere rules. 7 Fixing codepage issues. 8 Extraction of metadata from the pdf content, like authors, date, publisher, doi, arxiv and adding it into metadata. 9 Visiting the page by doi link, or in absence of doi link, other link classified as a link to the page of the paper on publisher's website and extracting bibtex info from it, embedding it into pdf metadata 10 getting metadata from other sources like google scholar 11 merging and normalizing the metadata from different sources smartly 12 detection of headers in the pdf text. For example they usually have the larger font than the majority of text. 13 detection and parsing of contents in the text 14 matching parsed contents to headers and creating bookmarks 15 detecting hierarchy in bookmarks, for example, by numbers. 16 merging bookmarks trees: the generated one and the already present one 17 detecting page numbers in headers/footers and making them equal to the ones used in pdf. 18 Compressing the contents with the best compression available. 19 Detecting named entities like chemical compounds and creating highlighting annotations for them

Relevant Technology

Who is this for

Any python dev who are willing to spend his own priceless irrecoverable time on such a large project.

Bisaloo commented 6 years ago

Everything related to metadata and file names is handled quite well by Zotero.

KOLANICH commented 6 years ago

I have already tried zotero and was deeply dissatisfied with it.

koppor commented 6 years ago

In JabRef we have the basic helper classes for that ready. The linked Docear is a fork of an old JabRef version and currently unmaintained. Some of our developers don't like Java, too. However, it is IMHO hard to port the functionality to another language. So I would propose to dive into the code.

In the developers, we had hard discussion on point 8. I wrote code to parse author data from the title page. It worked only for IEEE and LNCS papers. Currently, we decided to use GROBIT in JabRef as it provides the highest recognition rate (refs https://github.com/koppor/jabref/issues/327).

wanghaisheng commented 6 years ago

pls do try pdf.js and pdfplumber

KOLANICH commented 6 years ago

Thank you for informing me about pdfplumber, should be definitely helpful.

ghost commented 5 years ago

Couple extra links for mining PDFs from a security perspective:

  1. Didier Steven's excellent pdf-parser.py for low level probing. i. Some folks are building automated workflows for malware analysis of PDFs using them
  2. Though not Python, Phil Harvey's exiftool is the most comprehensive metadata extractor I've seen
KOLANICH commented 1 year ago

@ProfessorNavigator