google-research / arxiv-latex-cleaner

arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv
Apache License 2.0
5.23k stars 327 forks source link

--commands_to_delete hangs forever #63

Closed odedp closed 2 years ago

odedp commented 2 years ago

Here's a minimal example. If my source file includes:

\todo1{
\begin{figure}
\caption{\todo2{\emph{problem}}}
\end{figure}
}

When running:

python3.8 -m arxiv_latex_cleaner --commands_to_delete todo1 todo2 --verbose sources

It seems to hang forever on the above file. In my attempts, removing any of the todo1, todo2, figure, or emph seems to make the problem go away...

jponttuset commented 2 years ago

Thanks for reporting the problem @odedp I'd really appreciate it if someone could find the time to debug this issue, otherwise I'll try to get to it at some point.

bradknox commented 2 years ago

I'm seeing this issue too with my own paper. Has anyone found a workaround?

bradknox commented 2 years ago

I did a pseudo-binary search to find the problematic command and manually removed it.

I played around with just that command, trying to find a minimal example. I found that this version allows arxiv_latex_cleaner to complete but slows it down by 10x or so:

\test{Alternative: $\Psi \{\psi_{\pi_{SF}}\}$}

And this version does appear to make it hang indefinitely: \test{Alternative: $\Psi \cup \{\psi_{\pi_{SF}}\}$}

I'm running python3 -m arxiv_latex_cleaner <folder_name> --commands_to_delete test with Python 3.8.2.

dpuleri commented 2 years ago

I'm able to reproduce the issue with the --commands_only_to_delete option when I have a custom command around an equation environment.

Something like this, where test is the environment.

\test{
\begin{equation}\label{eq:conversion}
    \begin{split}
         a &= \frac{b}{\mathrm{c_{t^2}}}\\
         c &= d
    \end{split}
\end{equation}.
}
}

If I make the equation simpler by reducing the levels of nested curly braces, then the code doesn't cause a hang. So, to me it seems like it may have something to do with deep levels of nesting within the custom command?

joellindegger commented 2 years ago

Hi all

I ran into this problem as well. After some debugging I found the following:

  1. The code hangs on Line 114 of arxiv_latex_cleaner.py when using the regex pattern built on Line 106
  2. The regex pattern built by on Line 106 [1] only matches nested latex commands up to a depth of 3, and fails to match commands nested to depth 4 or higher. I.e., even if one waits for the routine to finish it fails to match commands nested deeper. To double check I tried the pattern on regex101
  3. Correct behavior can be achieved with a recursive regex pattern [2], see this Demo on regex101 based on this stackoverflow answer
  4. Unfortunately, Python's re module does not support recursive patterns. However, the third-party regex module (see here) is a drop-in replacement for re and implements the recursive subroutine used in ths pattern.
  5. I have implemente a bugfix based on this pattern with the regex module, which seems to work correctly and does not hang

If @jponttuset can confirm that the additional external dependency on the regex module is acceptable I am happy to create a pull request for this (very simple) bugfix.

[1] Line 106 of arxiv_latex_cleaner.py: base_pattern = r'\\' + command + r'{(?:[^}{]+|{(?:[^}{]+|{[^}{]*})*})*}, fails, Demo on regex101 [2] Correct pattern: base_pattern = r'\\' + command + r'\{((?:[^{}]+|\{(?1)\})*)\}' works, Demo on regex101 based on this stackoverflow answer

jponttuset commented 2 years ago

Thanks so much @joellindegger for the investigation! Adding regex is perfectly fine, it'd be great if you could send a PR.