PDF Scrapper: Look-up existing .bib files during "BibTeX mode"

j-steinbach commented 3 years ago

At the moment, the PDF Scrapper looks up extracted cite-keys after it has finished its process and then sorts them into in-roam and in-bib.

It makes sense to compare the "BibTeX mode" references with the existing bibliographic files during the extraction process.

Ideally it would compare the extracted keys with existing keys, so that the user can immediately identify erroneous or duplicate keys. This is very helpful if the the user uses an external reference management tool (Zotero, papis, ...), which auto-generates keys in different way.

Example (split pane view)

(extracted)               |    (global .bib file)
adam2003eve               |    adam2002eva
                          |    adam2003event
bert2004egon              |  
charles2997manson         |   
dagmar1002duck            |    dagmar1002duck_tales

For a more long-term perspective, there should also be ways to insert those references into the (global) .bib file(s).

myshevchuk commented 3 years ago

Have you checked the orb-autokey functionality? It allows to configure key generation to your liking, and should be able to cover the format of the keys listed in the right pane. The keys presented initially in the buffer are generated by AnyStyle, not ORB, there is no control over them. You can then press C-c C-u to generate keys according to orb-autokey-format. This of course can be automated, so that the generation happens immediately after the entries have been extracted. I will add an option for that.

Similarity search is a bit trickier than simple exact matching of the keys. I will need to check if there exist a general Elisp library for text similarity search. Otherwise, it wouldn't be currently viable to implement it from scratch, at least not now.

The split pane view should probably be a separate buffer invoked with its own command where those keys are listed in a table. This will also need to wait.

What can be done quickly and reliably, is inserting a comment above the BibTeX entry with the key matched from the library.

In any case, thank you very much for your interest and ideas!

j-steinbach commented 3 years ago

Yes, I am using (setq orb-autokey-format "%a%y%t") in ORB and [auth:lower][year][shorttitle1_1:lower] in Zotero (BBT), but sometimes they produce different titles. My guess: They "ignore" different words to create the short title.

I know (and use) the C-c C-u generation. It does not always work - for those I have to manually "create" a title. Here it is very helpful if I know if that paper is already in my database. Note: Even if I know that a title is duplicate/already exists, I still generate the key, so I have a local, note-only list of extracted cite keys.

Is it not possible to re-use the "fuzzy search" features from (don't know the name, Helm? Ivy?).

Also I am not sure how much a "similarity" search is needed. The first word (usually the author) should do the job - In my example, it lists all the keys/entries from the global .bib file. Then it puts them on the same "height" as the first letter that matches. There is no need to list the whole bibliography on the right. Ive had the "git compare sources" view in mind when I wrote double-pane. Maybe that helps.

And I am enthusiastic about the project because I currently have to read lots of papers. The PDF Scrapper fits perfectly into my workflow and saves me lots of time - time I "give back" by reporting every minor annoyance :)

myshevchuk commented 3 years ago

Yes, I am using (setq orb-autokey-format "%a%y%t") in ORB and [auth:lower][year][shorttitle1_1:lower]

I haven't been using Zotero for years now; what does this do exactly: shorttitle1_1? Is it the first two words separated with underscore? If yes, then you can achieve the same in ORB with %t[2][][_]. See also orb-autokey-titlewords-ignore for a list of words in titles to be ignored in autokey generation. You can add to or remove words from this variable to match the BBT's behaviour.

I will look how Helm and Ivy implement the fuzzy search, and check what else is available in Emacs ecosystem. There definitely should be something. The split pane view will require a major effort, so I can't promise it done quickly.

And I am enthusiastic about the project because I currently have to read lots of papers. The PDF Scrapper fits perfectly into my workflow and saves me lots of time - time I "give back" by reporting every minor annoyance :)

Great! I'm glad you find it useful and your input is really very appreciated.

j-steinbach commented 3 years ago

From the BBT docs it is simply

shorttitleN_M: The first N (default: 3) words of the title, apply capitalization to first M (default: 0) of those

Here are two keys I observed to get generated differently:


Author:Ryan
Date: 2000
Title Self-Determination Theory and the Facilitation of Intrinsic Motivation, Social Development, and Well-Being

gets turned into

Zotero/BBT: ryan2000selfdetermination
ORB Autokey: ryan2000self

and

author: Gartner
year: 2013
title: Gartner’s 2013 hype cycle for emerging technologies

gets turned into

Zotero/BBT: gartner2013gartner
ORB Autokey: gartner2013

myshevchuk commented 3 years ago

gets turned into

Zotero/BBT: ryan2000selfdetermination ORB Autokey: ryan2000self and

author: Gartner year: 2013 title: Gartner’s 2013 hype cycle for emerging technologies gets turned into

Zotero/BBT: gartner2013gartner ORB Autokey: gartner2013

This can be considered a bug, if filed a new issue for it.

org-roam / org-roam-bibtex

PDF Scrapper: Look-up existing .bib files during "BibTeX mode" #144