google-research / arxiv-latex-cleaner

arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv
Apache License 2.0
5.04k stars 318 forks source link

Script slow with many files #70

Closed mseitzer closed 1 year ago

mseitzer commented 1 year ago

Hi,

using the cleaner on a folder with many files (~1000), the script appeared to hang. However, it actually was processing, just very slow.

I was able to track down the slowness to this line:

https://github.com/google-research/arxiv-latex-cleaner/blob/ea9d6db134963c2396d00ca0f5112bdd39cb763c/arxiv_latex_cleaner/arxiv_latex_cleaner.py#L76

When passing the list of files to _remove_pattern from _list_all_files, _keep_pattern is called per file, but with the full list of files to the haystack argument. Thus, _remove_pattern has quadratic complexity when it should have linear complexity.

Changing the line in question to

      if item not in _keep_pattern([item], patterns_to_remove)

fixed the problem.

However, even with quadratic complexity, this operation should not be so slow with just 1000 files; I suspect the regex operation regex.findall(rem, item) in _keep_pattern to be an additional cause for slowness, because it has to compile the search pattern on each invocation (a slow operation in regex parsing). It might be worthwhile to compile the pattern into a regex object only once, and change _keep_pattern, _remove_pattern to directly accept regex objects, instead of string patterns.

jponttuset commented 1 year ago

Thanks @mseitzer for the research! I'd appreciate it if you could find the time to send a PR.