Open KOLANICH opened 6 years ago
Everything related to metadata and file names is handled quite well by Zotero.
I have already tried zotero and was deeply dissatisfied with it.
In JabRef we have the basic helper classes for that ready. The linked Docear is a fork of an old JabRef version and currently unmaintained. Some of our developers don't like Java, too. However, it is IMHO hard to port the functionality to another language. So I would propose to dive into the code.
In the developers, we had hard discussion on point 8. I wrote code to parse author data from the title page. It worked only for IEEE and LNCS papers. Currently, we decided to use GROBIT in JabRef as it provides the highest recognition rate (refs https://github.com/koppor/jabref/issues/327).
pls do try pdf.js and pdfplumber
Thank you for informing me about pdfplumber, should be definitely helpful.
Couple extra links for mining PDFs from a security perspective:
@ProfessorNavigator
Project description
I have a lot of PDFs. Most of them are full of garbage like embedded fonts, JavaScript, whitespaces in XMP etc. That's why they take too much space on my HDD and too long to load.
So we need a tool: 1 parsing pdf files in the folder 2 removing all the unneeded elements like javascript or redundant whitespaces in xmp. 4 removing all the fonts which are usually installed in the systems, such as times new roman or calibri 5 keeping a database of all the font files. Detection of different versions of the same font, combining them into a single font file installed into a system and removing the font from the files. 6 Fixing all the links in the documents according to HTTPSEverywhere rules. 7 Fixing codepage issues. 8 Extraction of metadata from the pdf content, like authors, date, publisher, doi, arxiv and adding it into metadata. 9 Visiting the page by doi link, or in absence of doi link, other link classified as a link to the page of the paper on publisher's website and extracting bibtex info from it, embedding it into pdf metadata 10 getting metadata from other sources like google scholar 11 merging and normalizing the metadata from different sources smartly 12 detection of headers in the pdf text. For example they usually have the larger font than the majority of text. 13 detection and parsing of contents in the text 14 matching parsed contents to headers and creating bookmarks 15 detecting hierarchy in bookmarks, for example, by numbers. 16 merging bookmarks trees: the generated one and the already present one 17 detecting page numbers in headers/footers and making them equal to the ones used in pdf. 18 Compressing the contents with the best compression available. 19 Detecting named entities like chemical compounds and creating highlighting annotations for them
Relevant Technology
Who is this for
Any python dev who are willing to spend his own priceless irrecoverable time on such a large project.