Scalability and maintainability for big bib-files

proteusGIT commented 3 years ago

I have a bib file with 1086 entries and biber requires 6 seconds to process it, which is alot w.r.t. the overall compile time of my document.

I propose to add the following features to biber.

biber should (probably based on some parameter) check first whether the list of used keys has changed: if not, biber should abort immediately.
biber should (probably based on some parameter) export a bib file containing the used references only. Then, if only references have been removed, biber should (probably based on some parameter) reuse the small bib file containing the used references only to generate the bbl file.

plk commented 3 years ago

I am afraid it's more complex that this, firstly, biber doesn't save state by design to make it easier to automate wit things like latexmk. Checking the key list isn't nearly enough as many things determine whether the .bbl will be identical such as sorting and labelling templates, encoding settings etc. You can already export a .bib file with only the used references - see the --output-format=bibtex when not used with tool mode.

People tend to see that standard bibtex is much faster than biber and then get rather annoyed about a few extra seconds processing for a batch program but biber does far more than bibtex does and isn't written in C ...

moewew commented 3 years ago

@proteusGIT probably knows about it and PLK already mentions it, but in everyday operations a tool like latexmk can be really helpful here. latexmk monitors both the .bcf file (that tells Biber which entries to cite) as well as the .bib file (where data for each key could change) and only runs Biber if one of those changes. This does not cut down on Biber's running time per se, but avoids unnecessary Biber runs.

aterenin commented 3 years ago

In my experience, the extra time is only a problem if the same bcf file is used as input to Biber over and over and over again, as part of an automated tool such as LaTeX Workshop in VS Code that repeatedly recompiles the same document.

An easy way to avoid this is simply to compute the MD5 hash of the bcf file and only invoke Biber if this file changes. The following one-liner does that: if [ \"$(openssl md5 %DOCFILE%.bcf)\" != \"$(cat %DOCFILE%.bcf.hash)\" ]; then; biber --nolog %DOCFILE%; openssl md5 %DOCFILE%.bcf > %DOCFILE%.bcf.hash; fi. Latexmk will do this automatically, but the above is helpful for those using Tectonic, which by its nature does not require Latexmk.

dspinellis commented 2 years ago

To obtain only the bib entries used in a given document you can use the bibget tool.

plk commented 2 years ago

biber also has options to output bibtex format of only the reference used in a document instead of a .bbl

dspinellis commented 2 years ago

Thank you for pointing it out. I wrote bibget for a different use case (sharing a part of your private bibliography files). However, for the case made here (performance) Biber extraction is a better choice and make-like dependency checks an even better one.

thomwiggers commented 2 years ago

I've similarly written a tool like bibget to first filter the references that are actually present in the .bcf out of a 63431-entry .bib file. It seems biber processes and decodes all references in a .bib file before throwing out those that are not relevant. Running biber (which takes >60 seconds) takes the vast majority of my compile time.

My tool just uses a dumb regex to decide what the boundaries of entries are. I'm not sure if I'm missing some correctness argument that justifies the extra processing that biber does here, but otherwise could this maybe be considered?

dspinellis commented 2 years ago

@thomwiggers The "dumb regex" is fine and will work for your files, but is likely to fail on file formatting conventions or some special cases of other users.

thomwiggers commented 2 years ago

@dspinellis sure, parsing a non-regular language with a regex is asking for trouble. But that doesn't mean a simpler parser could filter out irrelevant entries before biber applies all of its processing power.

plk commented 2 years ago

biber is really a semantic tool by design and I don't really want to add a syntactic filtering layer. This use-case is fairly straightforward I would have though - just run biber once in normal mode with the --output-format=bibtex option to obtain a .bib containing only the referenced entries in the document. That's a one-time operation and if you change the document, you can then just manually add any new references (and optionally delete removed ones) from the new .bib, just like you would with any normal .bib file.

dspinellis commented 2 years ago

@plk Running with --output-format=bibtex seems like a good approach. I'll keep it in mind.

thomwiggers commented 2 years ago

That does require changing the LaTeX file to switch between .bib source files, which is slightly annoying. Also it kinda defeats the point of projects like cryptobib But yeah it does work.

Regarding not wanting to implement it: fair enough; after a bit of digging it definitely doesn't seem like a trivial change.

e-dervieux commented 1 year ago

I am having the same problem. I write a lot of documents, and I like keeping my bibliography unified between them. Thus, I made a git repo containing my main .bib file and I can simply bring it into a new LaTeX project by cloning / push-pull-ing it when nedd be. The thing is this file grew over the years, and I now have nearly a thousand entries in that file, with over 11k lines in it!

While BibTex takes less than half a second to process it, biber takes several seconds, eating up a significant part of the overall compilation time. Now, I understand that biber does a lot more stuff that BibTex and that it is not written in C. However, it would me nice to have a preliminary parsing to keep only the relevant entries before actually processing them to speed up the compilation time.

For instance, by using a first parsing which extracts the relevant lines and put them into a second, temporary, .bib file before actually processing the later. This may be performed using a wrapper around bibtool's bibget?

In all cases, it is quite annoying, and I do not see alternatives to achieve the following goals:

having a way to maintain coherency in my bibliography, being able to sync it between different projects without having suplicates, or having to manually copy / paste many bib entries.
having a manageable compilation time
having a bibliography management tool that handles UTF-8

If you have any idea, please let me know.

FYI, to put things into perspectives, it is quite common that I write relatively small documents (i.e. 20-50 pages max) having well over one hundred citations.

plk / biber

Scalability and maintainability for big bib-files #371