Closed proteusGIT closed 2 years ago
I am afraid it's more complex that this, firstly, biber doesn't save state by design to make it easier to automate wit things like latexmk
. Checking the key list isn't nearly enough as many things determine whether the .bbl
will be identical such as sorting and labelling templates, encoding settings etc. You can already export a .bib file with only the used references - see the --output-format=bibtex
when not used with tool mode.
People tend to see that standard bibtex
is much faster than biber
and then get rather annoyed about a few extra seconds processing for a batch program but biber
does far more than bibtex
does and isn't written in C ...
@proteusGIT probably knows about it and PLK already mentions it, but in everyday operations a tool like latexmk
can be really helpful here. latexmk
monitors both the .bcf
file (that tells Biber which entries to cite) as well as the .bib
file (where data for each key could change) and only runs Biber if one of those changes. This does not cut down on Biber's running time per se, but avoids unnecessary Biber runs.
In my experience, the extra time is only a problem if the same bcf
file is used as input to Biber over and over and over again, as part of an automated tool such as LaTeX Workshop in VS Code that repeatedly recompiles the same document.
An easy way to avoid this is simply to compute the MD5 hash of the bcf
file and only invoke Biber if this file changes. The following one-liner does that: if [ \"$(openssl md5 %DOCFILE%.bcf)\" != \"$(cat %DOCFILE%.bcf.hash)\" ]; then; biber --nolog %DOCFILE%; openssl md5 %DOCFILE%.bcf > %DOCFILE%.bcf.hash; fi
. Latexmk will do this automatically, but the above is helpful for those using Tectonic, which by its nature does not require Latexmk.
To obtain only the bib entries used in a given document you can use the bibget tool.
biber
also has options to output bibtex format of only the reference used in a document instead of a .bbl
Thank you for pointing it out. I wrote bibget for a different use case (sharing a part of your private bibliography files). However, for the case made here (performance) Biber extraction is a better choice and make-like dependency checks an even better one.
I've similarly written a tool like bibget
to first filter the references that are actually present in the .bcf
out of a 63431-entry .bib
file. It seems biber
processes and decodes all references in a .bib
file before throwing out those that are not relevant. Running biber (which takes >60 seconds) takes the vast majority of my compile time.
My tool just uses a dumb regex to decide what the boundaries of entries are. I'm not sure if I'm missing some correctness argument that justifies the extra processing that biber does here, but otherwise could this maybe be considered?
@thomwiggers The "dumb regex" is fine and will work for your files, but is likely to fail on file formatting conventions or some special cases of other users.
@dspinellis sure, parsing a non-regular language with a regex is asking for trouble. But that doesn't mean a simpler parser could filter out irrelevant entries before biber applies all of its processing power.
biber
is really a semantic tool by design and I don't really want to add a syntactic filtering layer. This use-case is fairly straightforward I would have though - just run biber
once in normal mode with the --output-format=bibtex
option to obtain a .bib
containing only the referenced entries in the document. That's a one-time operation and if you change the document, you can then just manually add any new references (and optionally delete removed ones) from the new .bib
, just like you would with any normal .bib
file.
@plk Running with --output-format=bibtex
seems like a good approach. I'll keep it in mind.
That does require changing the LaTeX file to switch between .bib
source files, which is slightly annoying. Also it kinda defeats the point of projects like cryptobib But yeah it does work.
Regarding not wanting to implement it: fair enough; after a bit of digging it definitely doesn't seem like a trivial change.
I am having the same problem. I write a lot of documents, and I like keeping my bibliography unified between them. Thus, I made a git repo containing my main .bib
file and I can simply bring it into a new LaTeX project by cloning / push-pull-ing it when nedd be. The thing is this file grew over the years, and I now have nearly a thousand entries in that file, with over 11k lines in it!
While BibTex takes less than half a second to process it, biber takes several seconds, eating up a significant part of the overall compilation time. Now, I understand that biber does a lot more stuff that BibTex and that it is not written in C. However, it would me nice to have a preliminary parsing to keep only the relevant entries before actually processing them to speed up the compilation time.
For instance, by using a first parsing which extracts the relevant lines and put them into a second, temporary, .bib
file before actually processing the later. This may be performed using a wrapper around bibtool's bibget
?
In all cases, it is quite annoying, and I do not see alternatives to achieve the following goals:
If you have any idea, please let me know.
FYI, to put things into perspectives, it is quite common that I write relatively small documents (i.e. 20-50 pages max) having well over one hundred citations.
I have a bib file with 1086 entries and biber requires 6 seconds to process it, which is alot w.r.t. the overall compile time of my document.
I propose to add the following features to biber.
biber should (probably based on some parameter) check first whether the list of used keys has changed: if not, biber should abort immediately.
biber should (probably based on some parameter) export a bib file containing the used references only. Then, if only references have been removed, biber should (probably based on some parameter) reuse the small bib file containing the used references only to generate the bbl file.