tmalsburg / helm-bibtex

Search and manage bibliographies in Emacs
GNU General Public License v2.0
465 stars 74 forks source link

support "Papers" storage format? #90

Open jowens opened 8 years ago

jowens commented 8 years ago

Delighted to see helm-bibtex aim for working with some papers management systems (the readme notes "JabRef, Zotero, and Mendeley"). http://papersapp.com/ is a common/popular application. It does not store all PDFs in a single folder but instead in a bunch of subfolders: Library.papers3/Files/??/foo.pdf, where ?? is a two-digit hex number and foo is what looks like a unique ID (e.g., "0008E889-3F87-4606-8842-7ABDD13A6AC7.pdf") (where the first two characters of the PDF name are ??).

First, it would be valuable if helm-bibtex-library-path was able to look two layers deep instead of one.

Second, if we could understand Papers's filenaming convention, helm-bibtex could perhaps map between bib entries and citekeys automatically, even without having to explicitly store the filename in the bibtex file.

@alexandergriekspoor @extracts @PapersGenius can you help us out here with the second?

tmalsburg commented 8 years ago

In principle I'm happy to add features that make it easier to use third-party software. For example, I implemented support for the file field although I personally don't need it. However, I must say I have very little interest in investing any amount of time in features that only support commercial, close-source software, especially when that software locks in the data by using cryptic naming schemes. I also don't feel comfortable about adding features that benefit only individual bibliography managers. There are so many bibliography managers and we can hardly support all their quirks.

If anyone else is interested in working on support for Papers, I would consider merging a PR, but note that I would only include code that does not impair the experience for other users. The biggest concern will be the speed of loading the library. Some people use helm-bibtex with rather big bibliographies (e.g., crypto.bib) and reading those libraries is already a bit slow. If we have to do additional work to find PDFs, the time for loading those libraries might become unbearable and helm-bibtex would be useless.

Another thought: I think this should rather be a feature request for the developers of Papers. Other bibliography managers have converged on using the file field to link PDFs to entries and Papers should ideally do the same. (If Papers used the file field, the directory structure and naming conventions would be a non-issue.)

jowens commented 8 years ago

Hi @tmalsburg, thank you for your kind (and quick) reply. I am sympathetic to your views on closed-source software, and agree with your high-level goals.

Papers's bibliographies do contain a file field, e.g.:

file = {{9B1D4DB3-509B-4A1E-9C60-E908FA901731.pdf:/Users/jowens/Dropbox/Library.papers3/Files/9B/9B1D4DB3-509B-4A1E-9C60-E908FA901731.pdf:application/pdf}}

but (a) I usually manage my bibliography in emacs as opposed to letting Papers own it, and (b) since it's a shared bibliography among my whole research group, I'd rather not put a bunch of me-specific fields in the bibliography; it would seem preferable from a management view if the filename was derivable from the normal bibliography fields (e.g., like using the citekey).

Again, thanks.

tmalsburg commented 8 years ago

I'm not sure if I fully understand your workflow and requirements but it seems that the file field is not going to be useful if your group maintains a shared BibTeX file because the content of the file field will only be correct on one user's machine.

Perhaps the best solution would be if everyone in your group would name PDFs following the scheme bibtexkey.pdf. Helm-bibtex will then find the PDFs and more importantly this naming is completely transparent such that you can always fall back to using a file manager for accessing the PDF. Of course, this is not a satisfying solution for people using other bibliography managers which may not find the PDFs, but I don't see a better alternative.

Perhaps have a look at Mendeley, which has support for shared libraries. We used it in one of my previous research groups but it wasn't very successful because the shared bib got very messy with lots of duplicates and incomplete or incorrect entries. Personally, I have given up on the idea of a shared library. The only successful example of a shared library that I'm aware of is CrytoBib which is very clean. However, it is evident that building this bibliography was a major effort, something for which I wouldn't have the time.

If there is a way to generate the Paper path to the PDF from the BibTeX fields, we can discuss possibilities to support that in helm-bibtes, but, honestly, I doubt that it's possible. This naming scheme rather looks as if it was designed to prevent precisely that. Also, it seems that you would need support for several ways to find PDFs because not all entries in your shared library might follow the Papers scheme. If correct, we will have the problem that loading of the library might be slow because all schemes have to be tried for every entry in order to make sure that PDFs are reliably found.

jowens commented 8 years ago

@tmalsburg Your suggestions are very much appreciated and spot on. I do think you understand our workflow. :) The proximate issue where your help might benefit us is helm-bibtex-library-path. Currently this is a list of directories, where each directory contains PDFs. Papers instead uses a directory structure that is two layers deep, and adding 256 directories to helm-bibtex-library-path is a little ... kludgey. If helm-bibtex-library-path supported either a) arbitrary recursive descent for PDFs or b) two-layer descent, that would more properly match what Papers uses (and also match the PDF schemes of other researchers that might have a two-directory-deep scheme like papers/year/paper.pdf or papers/author/paper.pdf).

I will investigate the naming scheme from the Papers end (either being able to generate the filenames automatically from bib entries or using another deterministic naming scheme like citekey); or to produce or generate a flat directory of PDFs named using the citekey that are softlinks to the actual papers.

Again, vielen Dank for your hard work and consideration!

tmalsburg commented 8 years ago

and adding 256 directories to helm-bibtex-library-path is a little ... kludgey.

You could do this automatically with two or three lines of elisp.

I will investigate the naming scheme from the Papers

Looking forward to seeing what you find out about this.

jowens commented 8 years ago

OK. Didn't know the impact on helm-bibtex if I put 256 directories in helm-bibtex-library-path. Assuming from your response it won't be an issue. I'll look into it!

tmalsburg commented 8 years ago

I haven't tried it but it should be ok.

tmalsburg commented 8 years ago

One thing I don't understand is this: If you make a new entry with Papers, you end up with a PDF with a cryptic name but you will also have a file entry in the BibTeX entry 🡒 No problem. If you make an entry manually, you may not have a file field but you will also not have a cryptic file name for the PDF 🡒 Again no problem. Your problem seems to be that you have Paper's cryptic file names but no file fields, but I understand how this situation can arise.

Perhaps you do have file fields but they point to the PDFs on other people machines (since the bibliography is shared)? If that's the problem, the easiest solution would be a simple search-and-replace to fix the paths in the file fields, e.g. /Users/jdoe/Dropbox/Library.papers3/Files/9B/9B1D4DB3-509B-4A1E-9C60-E908FA901731.pdf:application/pdf 🡒 /Users/jowens/Dropbox/Library.papers3/Files/9B/9B1D4DB3-509B-4A1E-9C60-E908FA901731.pdf:application/pdf (s/jdoe/jowens/g).

jowens commented 8 years ago

What @extracts and I are working on is a a set of scripts that does the following that I think satisfies my needs and also requires no work from you:

Then I can pass that single directory to helm-bibtex in helm-bibtex-library-path, the papers are all named using my citekey format, and helm-bibtex works without modification, except for the translation issue in https://github.com/tmalsburg/helm-bibtex/issues/91 :) .

tmalsburg commented 8 years ago

Elaborating my comment above: For example, we could allow the user to specify custom transformations that are applied to the BibTeX file before it is parsed by helm-bibtex. The benefit is that this facility can not just be used to solve the present problem but perhaps also all kinds of other problems. Another benefit is that we would need no Papers-specific changes in helm-core but could factor this stuff out into customizations.

tmalsburg commented 8 years ago

Re your comment: That sounds like a good solution from the helm-bibtex perspective.

tmalsburg commented 8 years ago

But note that Dropbox doesn't understand softlinks. Last time I tried it, Dropbox replaced softlinks by hard copies of the files. Since your group seems to rely on Dropbox, this may be inconvenient because your Bibliography would occupy twice as much space in the Dropbox as necessary.

jowens commented 8 years ago

I like the idea of custom-transformations. It's not entirely clear to me how visible these transformations are. My use case is:

If your custom-transformation idea could let me see/use my own citekey format but under the hood choose files using the derivable citekey format, I am happy.

In terms of softlinks + Dropbox: I absolutely have to figure this out, thank you for pointing it out. I'll work with @extracts on this. Another option is another cloud service (box.com or the like). But yeah, we definitely don't want to burn 2x the storage (that would exceed my Dropbox cap for sure).

I appreciate your quick responses and ideas! I hope our dialog will make helm-bibtex better.

jowens commented 8 years ago

box.com ignores symlinks, it appears. But I think I can neatly solve this issue by leaving my actual papers in Dropbox ("owned" by Papers, with unique names), and then put the single directory with citekey-named papers outside of Dropbox, and then setting helm-bibtex-library-path to this single directory. That seems cleaner. (I don't need to share this single directory with anyone else. If a collaborator needs it, he/she can generate it him/herself.)

tmalsburg commented 8 years ago

Re other cloud services: they may very well have the same issues with softlinks. The underlying problem is that not all file systems support softlinks (e.g., FAT32) and cloud services can therefore not rely on them.

jowens commented 8 years ago

Yeah, but if I'm linking into a Dropbox share, that should be OK. (Actual files live inside Dropbox. Pointed-to files, labeled with citekeys, live outside Dropbox.)