jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.02k stars 3.35k forks source link

pandoc uses over 50 GB of RAM to convert a 6-page latex file to plain text #8684

Closed Franck-Dernoncourt closed 4 months ago

Franck-Dernoncourt commented 1 year ago

Pandoc 3.1.1 on Windows 10 uses over 50 GB of RAM to convert a 6-page latex file to plain text. Also, it doesn't seem to terminate (I killed it after 30 minutes). Command used:

pandoc --to=plain --wrap=none eacl2017.tex > out.txt

Latex file: https://arxiv.org/e-print/1612.05251 (corresponding PDF: https://arxiv.org/pdf/1612.05251.pdf)

image

jgm commented 1 year ago

When I download that URL I get a binary file, not a LaTeX file. That is probably the cause of the problem: pandoc is trying to parse a binary file as LaTeX.

jgm commented 1 year ago

Ah, I see that it's a gzipped tex file.

jgm commented 1 year ago

After unzipping, I get a tex file but it is quite unusual. IT has content before the \documentclass command and pdflatex immediately raises an error when I try to process it. Pandoc fails with this error:

Error at "8684.tex" (line 300, column 3):
unexpected #0
{ #0 'before.all :=
  ^

Can you directly upload your input eacl2017.tex? I have a feeling this is not it.

tarleb commented 4 months ago

This appears stale, thus closing.