tmalsburg / helm-bibtex

Search and manage bibliographies in Emacs
GNU General Public License v2.0
462 stars 73 forks source link

slow parsing #159

Closed mclearc closed 7 years ago

mclearc commented 7 years ago

This may not be a bug, but I have a moderately long bib file (about 4200 entries), which takes several minutes to parse on initial start-up of helm-bibtex. Any suggestions about how this might get sped up? It may be relevant that I'm using Zotero and Better Bib(La)TeX to generate the bib file. Thanks.

tmalsburg commented 7 years ago

This can definitely be considered a bug. My bib has ~1200 entries and parsing takes a fraction of a second. Perhaps it's related to finding PDFs. Could you please post a representative entry from your bib file?

mclearc commented 7 years ago

Sure, here are three fairly representative entries. Helm-bibtex does eventually parse the file, and after the parsing things are fine until I have to restart emacs.

@article{fricker2016,
  title = {What's the {{Point}} of {{Blame}}? {{A Paradigm Based Explanation}}: {{What}}'s the {{Point}} of {{Blame}}},
  volume = {50},
  timestamp = {2016-10-14T18:44:40Z},
  number = {1},
  journaltitle = {Noûs},
  author = {Fricker, Miranda},
  date = {2016},
  pages = {165--183},
  file = {fricker2016_what's_the_point_of_blame.pdf:/MasterLib/fricker2016_what's_the_point_of_blame.pdf:application/pdf}
}

@article{valaris2016,
  title = {What {{Reasoning Might Be}}},
  abstract = {The philosophical literature on reasoning is dominated by the assumption that reasoning is essentially a matter of following rules. This paper challenges this view, by arguing that the rule-following model of reasoning, by arguing that it misrepresents the nature of reasoning as a personal-level activity. Reasoning must reflect the reasoner’s take on her evidence. The rule-following model seems ill-suited to accommodate this fact. Accordingly, this paper suggests replacing the rule-following model with a different, semantic approach to reasoning.},
  timestamp = {2016-11-02T03:58:19Z},
  author = {Valaris, Markos},
  date = {2016},
  keywords = {reason,reasoning},
  file = {valaris2016_what_reasoning_might_be.pdf:/MasterLib/valaris2016_what_reasoning_might_be.pdf:application/pdf}
}

@article{weinberg2016,
  title = {What Is the a Priori, That Thou Art Mindful of It?: {{A}} Comment on {{Albert Casullo}}, {{Essays}} on a Priori Justification and Knowledge},
  volume = {173},
  timestamp = {2016-10-14T18:44:40Z},
  number = {6},
  journaltitle = {Philos Stud},
  author = {Weinberg, Jonathan M.},
  date = {2016},
  pages = {1695--1703},
  keywords = {a priori,empiricism,epistemology},
  file = {weinberg2016_what_is_the_a_priori,_that_thou_art_mindful_of_it.pdf:/MasterLib/weinberg2016_what_is_the_a_priori,_that_thou_art_mindful_of_it.pdf:application/pdf}
}
tmalsburg commented 7 years ago

Could you please set bibtex-completion-pdf-field to nil and test again?

mclearc commented 7 years ago

Thanks for the suggesting. Looks like that speeds things up. It still takes about 30 seconds the first time I load helm-bibtex, but it is definitely faster than with the pdf-completion-field set

tmalsburg commented 7 years ago

Interesting, I expected no effect or that parsing would be around 1 or 2 seconds. Would you mind giving me access to your complete bib file? My email address is: malsburg@uni-potsdam.de

mclearc commented 7 years ago

Thanks. I sent you the bib file. I should also say, in case it matters, that I use use-package to lazy load helm-bibtex. But I would't think that would affect load time to the degree that it has.

tmalsburg commented 7 years ago

Ok, I can reproduce this problem. But I get extremely variable load times ranging from 8s to 110s and I couldn't pin down what's causing this variability. One thing that I noticed is that loading is really fast (1.3s) when I use the following settings:

(setq helm-bibtex-bibliography "/tmp/test.bib")
(setq helm-bibtex-notes-path nil)
(setq helm-bibtex-library-path nil)
(setq helm-bibtex-pdf-field nil)

Could you please show me how you set these variables?

I think the culprit is the code that searches for notes. Do you have one notes file? And if yes, how large is it?

tmalsburg commented 7 years ago

The other issue likely is that you're referencing PDFs via the file field which is know to be slow. I store all PDFs in one directory and name them <bibtex-key>.pdf which makes it much easier to find them. I would recommend that you change to that way of linking PDFs but it appears that you're also using JabRef and that doesn't understand this.

mclearc commented 7 years ago

Here's what I have under :config in my use-package set-up:

    (setq bibtex-completion-bibliography "/Users/roambot/Dropbox/Work/Master.bib" 
          bibtex-completion-library-path "/Users/roambot/Dropbox/Work/MasterLib/"
          bibtex-completion-pdf-field nil
          bibtex-completion-notes-path "/Users/Roambot/projects/notebook/content/org_notes"
          bibtex-completion-additional-search-fields '(keywords)
          bibtex-completion-notes-extension ".org"
          helm-bibtex-full-frame nil) 
          ;; Set insert citekey with markdown citekeys for org-mode
    (setq bibtex-completion-format-citation-functions
          '((org-mode    . bibtex-completion-format-citation-pandoc-citeproc)
          (latex-mode    . bibtex-completion-format-citation-cite)
          (markdown-mode . bibtex-completion-format-citation-pandoc-citeproc)
          (default       . bibtex-completion-format-citation-default)))
    ;; Set default action for helm-bibtex as inserting citation
    (helm-delete-action-from-source "Insert citation" helm-source-bibtex)
    (helm-add-action-to-source "Insert citation" 'helm-bibtex-insert-citation helm-source-bibtex 0)
    (setq bibtex-completion-pdf-symbol "⌘")
    (setq bibtex-completion-notes-symbol "✎")
    )

I don't use a single notes files, but rather one note file per entry, which are all saved in one directory, "org_notes". I also store all my PDFs in one directory. I don't use <bibtex-key>.pdf but rather <bibtext-key><title>.pdf. But I would be surprised if this generates significant time lag. I have noticed a significant decrease in opening time, however, now that I have set the PDF file completion path to nil.

tgrigera commented 7 years ago

Hi, I've just found this open issue, and I'd like to point out that I'm having the same problem, and that I am also using a Zotero/BetterBibTex-generated .bib. I guess it is something related to the way Zotero is exporting the .bib, but I couldn't pin-point it. I've found that it sometimes help to delete the comments Zotero leaves at the end of the .bib.

anghyflawn commented 7 years ago

I'm not sure if it is the same issue but I've been having a problem that I'm not sure how to approach debugging. After any change in the .bib file, M-x helm-bibtex fires up the interface. but after any keyboard input it doesn't show anything and just echoes Parsing bibliography file <filepath> into the minibuffer forever. However, (bibtex-completion-candidates) is quite fast and after evaluating that M-x helm-bibtex works fine until the next change in the .bib file. This behaviour happens even if I just run emacs -q.

jmburgos commented 7 years ago

I had the same problem that anghyflawn described and never could figure out why that happened. At the end I switched to ivy-bibtex which works with no issues.

anghyflawn commented 7 years ago

Thanks @jmburgos, just to confirm that in my setup ivy-bibtex works with no issues, so it seems to be a problem on the helm side of things?

tmalsburg commented 7 years ago

@anghyflawn I think this is a separate issue. Strange though because ivy-bibtex and helm-bibtex are sharing most of their code. I never experienced this problem. A reproducible example would be helpful.

anghyflawn commented 7 years ago

Thanks! I have tried with a minimal setup (emacs -q, (require 'helm-bibtex)) and a short .bib file and it works, so I'm assuming it's something in my actual .bib file. The file has just over 2,800 entries; I've tried bisecting it but not yet found a consistent pattern of when it works and when it doesn't. I'll keep trying. (The file is here)

tmalsburg commented 7 years ago

@anghyflawn, parsing your bib file takes less than a second in my emacs, so it's probably not about the file but your configuration. Perhaps start with a minimal helm-bibtex configuration and then add your customizations step-by-step to see at which point it slows down.

Interesting bibliography, by the way.

anghyflawn commented 7 years ago

OK, so I have been playing around with it and it seems to me there's some intermittent miscommunication between the cache and helm. I have managed to build up from emacs -q to my full helm-bibtex configuration (I used helm defaults, even though I don't in real life) without running into this issue. However, having got there I then made some edits to the .bib file without changing the configuration, and the problem recurred (and ivy-bibtex still works fine, i.e. it does reread the bibliography). It looks to me specifically like calling helm-bibtex triggers a rereading of the bibliography but for some reason that doesn't feed back through to helm.

Interesting bibliography, by the way.

Heh, I wonder how many other helm-bibtex-using linguists there are :)

tmalsburg commented 7 years ago

Thank you, @anghyflawn, for reporting back. It's certainly useful to know that the problem does not occur with ivy-bibtex but overall the issue seems even more mysterious now. The reason is that in my config reading your bibliography (with helm-bibtex) is really fast even the first time which should be the worst-case scenario. I always suspected that parsing the file field was the issue but the fact that ivy-bibtex is fast speaks against that. Hm ... Could you please try setting bibtex-completion-pdf-field to nil and then rereading the bibliography via C-u M-x helm-bibtex (C-u clears the cache)? If this takes a long time, we can rule the file field out as a potential source of this problem.

Heh, I wonder how many other helm-bibtex-using linguists there are :)

Quite a few actually. Linguists have quite a strong presence in the Emacs community.

anghyflawn commented 7 years ago

bibtex-completion-pdf-field being nil doesn't seem to help, I'm afraid. However, your idea to call it with a prefix argument has allowed me to isolate what I think is the problem. If I call it with C-u it works as expected, i.e. rebuilds the cache and then launches helm, which works normally. Weirdly, every once in a while, I do get the same behaviour if I call helm-bibtex without the prefix argument but with an invalid cache (i.e. after an edit). The problem appears if the helm interface pops up before the parsing of the bibliography. It looks like there's some sort of weird race condition to me — if the helm interface gets ahead of the parsing, then some kind of blocking occurs, but if the parsing either does get ahead of helm, or you force it to do so via the prefix argument, everything works. Does that make sense? (This being a helm issue would also explain why ivy-bibtex doesn't have the same problem).

For the record this is my emacs version (this is from the Arch repos): GNU Emacs 25.2.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.22.10) of 2017-04-22

tmalsburg commented 7 years ago

Interesting. Does Emacs freeze when the race condition strikes? I experienced this with a couple of helm sources recently but not with helm-bibtex.

anghyflawn commented 7 years ago

No, emacs remains completely responsive.

tgrigera commented 7 years ago

I'm following this issue, since I am experiencing the same problem (freezing of the helm-bibtex interface, not emacs) when the bibliography is has to be parsed initially (subsequently it works fine, the problem recurs when I restart emacs or when the .bib file is changed and needs reparsing). Forcing re-parse with C-u seems to solve this, as @anghyflawn reports (reparsing takes about 7 seconds for 900+ items)

tmalsburg commented 7 years ago

@tgrigera thanks for reporting. 7 seconds is excessive for ~900 items, it should take around 1s on a recent computer. Unfortunately, I still can't reproduce this problem which makes it very hard for me to pin down what's causing this. A minimal reproducible example would be great.

tgrigera commented 7 years ago

@tmalsburg I know. I've tried producing a minimal .bib with the problem but have failed. Several times it happened that deleting a particular entry seemed to solve the problem, but then the entry on its own .bib worked perfectly. I'll report any news. In the meantime, 7s is bearable and allows me to do my work. I was afraid I would loose helm-bibtex due to this issue, but with this temporary solution I can go on, which is great because I find this package so useful.

jmburgos commented 7 years ago

@tgrigera, have you tried switching to ivy-bibitex? The functionality is the same, and I experience no lags.

tgrigera commented 7 years ago

@jmburgos I've never used ivy, and I'm not quite ready to try a new completion package (I haven't even mastered helm yet)

anghyflawn commented 7 years ago

I have similarly been trying and failing to construct a reproducible example from my bibliography, but the recurrence of the problem has been essentially random. One additional generalization that I seem to be able to make is that the reparsing takes more and more and time the longer I run emacs (I run it as a daemon and rarely switch my laptop off, so I can have pretty long sessions). I do wonder if it's something we should ask about over at helm, since all the bibliography code seems to work fine with ivy?

tmalsburg commented 7 years ago

@tgrigera, the "minimal" in minimal reproducible example is not referring to a minimal bib-file but to a minimal emacs configuration that exhibits the problem. If it is triggered by a race condition, you likely need a larger bibliography in order to trigger the problem.

Re ivy-bibtex, my vague memory is that it is missing some features. More generally, I really like the helm framework because it is so powerful. Ivy in my view is basically reinventing the wheel. Nothing wrong with that, but I prefer the more mature framework.

anghyflawn commented 7 years ago

Just to report that in the latest versions (currently 20170929.1253) my problem seems to have gone away. Helm version 20170928.2056, Emacs version GNU Emacs 25.3.1 (x86_64-pc-linux-gnu, GTK+ Version 3.22.19) of 2017-09-16 on Arch Linux.

tmalsburg commented 7 years ago

@anghyflawn, that's awesome. Thanks for reporting. We didn't make any relevant changes (or did we?), so I assume that the actual problem was somewhere outside helm-bibtex.

tmalsburg commented 7 years ago

@mclearc can you confirm that the problem is solved?

mclearc commented 7 years ago

@tmalsburg there is still a slight delay (6-10 secs) on the first startup of helm-bibtex. And I can't use bibtex-completion-pdf-field. But seems to work satsifactorily so I think I can close the issue if others no longer have any problems.

tmalsburg commented 7 years ago

Hm, 6-10s still seems too slow assuming your bibliography still has about 4200 entries. However, I tried it again with the bibliography that you sent me a while ago and on my system loading it takes about 3 seconds which is reasonable for a bibliography that size.

junwei-wang commented 3 years ago

This bib file is about 25 MB.

What is the reasonable loading time for file of this size?

It took me a minutes or more.

tmalsburg commented 3 years ago

I'm on Emacs' native-comp branch and my 2MB bibliography loads in less than a second. Based on that I'd expect 10s or so for 25MB but I haven't tried it. Also note that helm-bibtex uses caching, so the load should only happen once, and subsequent searches should be much faster (until you change the bibliography).

Parsing is done in Elisp by the package parsebib. The biggest room for improvements might be there. On my side, time is primarily spent for finding PDFs and notes. Do you have a lot of PDFs? And if yes, how are they linked? I'm using the naming scheme BibTeX-key.pdf (not the file field) which is computationally lighter.

junwei-wang commented 3 years ago

Sorry, forget the link to the bib. There are many crossrefs which I didn't include in my bib searching path.

One possible improvement (which probably already existing) is to cache via permanent binary, which will boost the loading.

I don't have a lot pdfs (<30) now. I use the BibTeX-key.pdf as well.

tmalsburg commented 3 years ago

Just tested with crypto.bib and it took 15s to parse it. But resolving crossreferences took just 1-2s, so it's the raw parsing that consumes most of the time.

In my experience, native-comp Emacs is approximately 3 times faster than ordinary Emacs. So something around a minute seems plausible.

In the second run (with caching) it takes less than 0.5s to load.

Caching on disk is not implemented yet but might make sense for users of really large bibliographies. PR welcome.

anghyflawn commented 3 years ago

In the meantime persistent caching for helm is possible with e.g. psession

tmalsburg commented 3 years ago

Wow, didn't know about psession. My computer just crashed and I wish I would have been able to restore my session. :)

tmalsburg commented 3 years ago

Thierry is doing such amazing work for the Emacs ecosystem! Last week I decided to make a donation to support him and his work.

jingxuanlim commented 1 year ago

I wanted to add another data point to being on the slow gang. Like what's been said already, I've noticed that parse times can be quite variable -- 10s - >100s (sometimes even longer but I get frustrated and end up restarting wsl2 which I hope help but probably doesn't actually). Don't know how to tell how many entries my .bib file has, but it does have 34k lines (is that long?). I'm using ivy-bibtex.

tmalsburg commented 1 year ago

There is no reason why the parse times should be variable. 10s to 100s also seems excessively long. For instance, crypto.bib (35k+ entries) loads in 7s on my system. And after the first read it takes just 0.1s (thanks to our caching mechanism). My own bibliography is about the same size as yours and takes less than 0.5s to parse. I suspect that there is another problem in your setup. If you share a minimal reproducible example (for emacs -Q), I can investigate.

Question: What is wsl2?

tmalsburg commented 1 year ago

Here is code for testing:

(require 'benchmark)
(bibtex-completion-clear-cache)
(benchmark-elapse (bibtex-completion-candidates))
jingxuanlim commented 1 year ago

Hi @tmalsburg. Thanks for agreeing to help!

wsl2 is windows subsystems for linux 2.

I'm an emacs noob, so I'm not sure if I'm doing this correctly. I tried to run your code on the *scratch* buffer and could only find an output in the *Messages* buffer.

For my existing config with doomemacs:

Parsing bibliography file zotLib.bib ...
Resolving cross-references ...
Done (re)loading bibliography.
24.3117471

This is one of the faster runs and probably because I just restarted my PC. However, this number is nothing near your 7 seconds for 35k entries record, suggesting that it could be improved.

For the fresh install, I launched emacs using the command you specified emacs -Q and ran the following commands on the *scratch* buffer

(require 'package)
;; Any add to list for package-archives (to add marmalade or melpa) goes here
(add-to-list 'package-archives 
    '("MELPA" .
      "http://melpa.org/packages/"))
(package-initialize)

I then manually installed the package:

  1. M-x package-refresh-contents
  2. M-x package-install RET ivy-bibtex

After which, I ran the same test.

(setq bibtex-completion-bibliography "zotLib.bib")

(require 'ivy-bibtex)
(require 'benchmark)
(bibtex-completion-clear-cache)
(benchmark-elapse (bibtex-completion-candidates))

For this "fresh install"

Parsing bibliography file zotLib.bib ...
Resolving cross-references ...
Done (re)loading bibliography.
2.1353558

Looks decent to me! So it does looks like my current config slows it down. That's interesting, but also I don't know how to improve the timing. Any ideas?

jingxuanlim commented 1 year ago

Also, I wanted to ask a question about the caching. Does it survive across sessions (i.e. after I M-x kill-emacs and run it again)? Using ivy-bibtex subsequent times after the biblio was parsed takes no time at all, but parsing happens every time I restart emacs.

tmalsburg commented 1 year ago

You didn't say how many entries your bibliography has, but 2s doesn't look unexpected. Difficult to tell what's causing the slowdown in your full setup. You'll have to debug it, i.e. incrementally commenting out parts of your config and see when it slows down.

Re caching: Caching is in memory and therefore needs to be redone in every new session. Storing the cache on disk is likely not worth the effort given that typical bibliographies should load in 1-3 seconds.

jingxuanlim commented 1 year ago

You didn't say how many entries your bibliography has, but 2s doesn't look unexpected. Difficult to tell what's causing the slowdown in your full setup. You'll have to debug it, i.e. incrementally commenting out parts of your config and see when it slows down.

Re caching: Caching is in memory and therefore needs to be redone in every new session. Storing the cache on disk is likely not worth the effort given that typical bibliographies should load in 1-3 seconds.

Okay, understood. Thanks!

rdiaz02 commented 1 year ago

In case it helps other people: I was stumbling upon this issue, caused by "The other issue likely is that you're referencing PDFs via the file field which is know to be slow" https://github.com/tmalsburg/helm-bibtex/issues/159#issuecomment-259631194

I have my references in Zotero and export them using Better BibTex. To try to minimize load time now I create a modified bib file (which is regenerated whenever the bib from Zotero changes ---using inotifywait). This modified bib file does not use the file field for PDFs; instead, I create dummy pdf file names that conform to the bibtexkey.pdf (or bibtexkey-1.pdf, bibtexkey-2.pdf, ...), that live in a scratch directory, and these dummy files sym link to the original PDFs. This scratch directory is also the bibtex-completion-library-path. This way, I can avoid having helm-bibtex referencing PDFs via the file field (i.e, I can (setq bibtex-completion-pdf-field nil), which is the default) . ~This is not perfect, since I loose all the additional name context (e.g., someone2000-suppl-mat.pdf), but n~ Now helm-bibtex loads a lot faster (6 seconds vs. the original 20).

The code uses R. I only use Linux, and make use of symlinks and the whole setup works because I set up a watch (with inotifywait) in the original bib file; I have no idea how to modify it to run in macs or Windows.

Please, make sure to read the comments before trying the code: I use at least one potentially destructive operation.

Link to the code: https://gist.github.com/rdiaz02/21253f2bf00500146c307612d57254c3

Edit 1: the name of the file is now bibtex key + filename, as per suggestion of @tmalsburg (see next comment); I have stricken through the original sentence that no longer applies.

tmalsburg commented 1 year ago

This is not perfect, since I loose all the additional name context (e.g., someone2000-suppl-mat.pdf), but now helm-bibtex loads a lot faster (6 seconds vs. the original 20).

You can keep the additional name context. Just prepend the bibtex key. helm-bibtex will find all PDF whose name starts with the bibtex key.

rdiaz02 commented 1 year ago

Thanks a lot for the suggestion!

My naming of files is very variable and inconsistent ("suppl-mat-someone-2020.pdf", "somethingSupplMat.pdf", etc, etc) but if I can just prepend the bibtex key to the file name, then it will be solved. I'll try to change the code and report here.