jgm / pandoc-citeproc

Library and executable for using citeproc with pandoc
BSD 3-Clause "New" or "Revised" License
291 stars 61 forks source link

performance of citation processing #190

Closed aw-bib closed 9 years ago

aw-bib commented 9 years ago

I tried recently to convert a book typeset in LaTeX to docx using pandoc. Everything worked out nicely except the references from BibTeX. I was able to strip the original bibtex input from some 1500 references to the used ones, but still with ~400 references pandoc-citeproc seems not to come to an end in processing. Given the fact that it is not possible to further strip down the number of bibtex-entries it would be nice if there could be some other way to handle such bibliographies.

I could have went by with conversion on a chapter basis, but with an input of ~400 entries it even didn't come to an end for a chapter with only 25 references. (I stopped it after some 15min at 100% cpu.)

Besides theses and other scientific books, pandoc would also come in handy for the production of bibliographies in a number of formats. E.g. something along the line of \nocite{*} with a givenbibtex`-input. However, for annual reporting schemes one easily hits several hundreds of publications. #71 does not seem to gain enough here.

I tried pandoc 1.15.1 on linux.

njbart commented 9 years ago

Well, this is what it looks like on my Mid-2011 MacBook Air, with a 1683-item biblatex file:

$ time -p pandoc -s -F pandoc-citeproc -o test.html << EOT
---
bibliography: test.bib
nocite: '@*'
...
EOT

real 58.08
user 56.97
sys 0.86
$ 

So this doesn’t look quite as bad as your report suggests.

Can you process your bib(la)tex files with latex/pdflatex/xelatex/… and bibtex/biber?

Any error messages with pandoc-citeproc -y yourfile.bib?

Any error messages with biber --tool -V yourfile.bib?

aw-bib commented 9 years ago

This sounds interesting indeed.

Can you process your bib(la)tex files with latex/pdflatex/xelatex/… and bibtex/biber?

Yes, in LaTeX everything compiles nicely and I get a bibliography as well.

Are there any known issues where pandoc-citeproc is known to be a bit more picky than e.g. bibtex?

I'll check the suggested tools tonight.

aw-bib commented 9 years ago

Any error messages with biber --tool -V yourfile.bib?

Fixed indeed an error with an invalid key.

As for pandoc-citeproc -y yourfile.bib I see no error message as such. However, Debians version of pandoc (1.12) throws a

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

It does not allow for the RTS-commands, thus I tried the latest and greatest deb from pandoc (1.15.1). This one starts running, eats some 9GB of RAM and sits there. Any ideas what may eat up the RAM? For me it sounds a bit like a parsing error, but as I've clue what to look for, not knowing what pandoc-citeproc tries to accomplish, I lack the idea what to look for.

jgm commented 9 years ago

+++ Alexander Wagner [Nov 09 15 10:25 ]:

Any error messages with biber --tool -V yourfile.bib?

Fixed indeed an error with an invalid key.

As for pandoc-citeproc -y yourfile.bib I see no error message as such. However, Debians version of pandoc (1.12) throws a Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it.

It does not allow for the RTS-commands, thus I tried the latest and greatest deb from pandoc (1.15.1). This one starts running, eats some 9GB of RAM and sits there. Any ideas what may eat up the RAM? For me it sounds a bit like a parsing error, but as I've clue what to look for, not knowing what pandoc-citeproc tries to accomplish, I lack the idea what to look for.

Can you upload your bibtex file somewhere so we can test?

aw-bib commented 9 years ago

Can you upload your bibtex file somewhere so we can test?

Sure. Feel free to fetch it from http://www.desy.de/~arwagner/pandoc-citeproc.bib

njbart commented 9 years ago

Delete CROSSREF = {Walden-2008}, from

@BOOK{Walden-2008,
  CROSSREF  = {Walden-2008},
  EDITION   = {1. publ.},
  EDITOR    = {Scott Walden},
  ISBN      = {9781405139243},
  LOCATION  = {Malden, MA},
  PAGETOTAL = {XII, 325},
  PPN_gvk   = {566382393},
  PUBLISHER = {Blackwell},
  SERIES    = {New directions in aesthetics},
  SUBTITLE  = {essays on the pencil of nature},
  TITLE     = {{P}hotography and philosophy},
  YEAR      = {2008},
}

… and try again.

Quite cleary something you should not have in your data. Not sure whether it’s possible (or worth trying) for pandoc-citeproc to catch this.

aw-bib commented 9 years ago

Ah! A loop, indeed. And of course you're right. How did you find it? I've some dealings with other peoples bibliographies and knowledge about "how to detect errors" come in handy.

aw-bib commented 9 years ago

@nickbart1980 you made my day. :)

300 pages later I can report a working conversion including all bibliographic references. And indeed there is no performance issue, it was indeed just the looping crossref.

Maybe you can comment here on how to find such errors or how you did it.

njbart commented 9 years ago

No special tools, I’m afraid, just vgrep :-)

jgm commented 9 years ago

We should probably fix pandoc-citeproc so it doesn't go into an infinite loop even with a loopy bibtex file. So I'll reopen this as a reminder to do that.