When passing the list of files to _remove_pattern from _list_all_files, _keep_pattern is called per file, but with the full list of files to the haystack argument. Thus, _remove_pattern has quadratic complexity when it should have linear complexity.
Changing the line in question to
if item not in _keep_pattern([item], patterns_to_remove)
fixed the problem.
However, even with quadratic complexity, this operation should not be so slow with just 1000 files; I suspect the regex operation regex.findall(rem, item) in _keep_pattern to be an additional cause for slowness, because it has to compile the search pattern on each invocation (a slow operation in regex parsing). It might be worthwhile to compile the pattern into a regex object only once, and change _keep_pattern, _remove_pattern to directly accept regex objects, instead of string patterns.
Hi,
using the cleaner on a folder with many files (~1000), the script appeared to hang. However, it actually was processing, just very slow.
I was able to track down the slowness to this line:
https://github.com/google-research/arxiv-latex-cleaner/blob/ea9d6db134963c2396d00ca0f5112bdd39cb763c/arxiv_latex_cleaner/arxiv_latex_cleaner.py#L76
When passing the list of files to
_remove_pattern
from_list_all_files
,_keep_pattern
is called per file, but with the full list of files to thehaystack
argument. Thus,_remove_pattern
has quadratic complexity when it should have linear complexity.Changing the line in question to
fixed the problem.
However, even with quadratic complexity, this operation should not be so slow with just 1000 files; I suspect the regex operation
regex.findall(rem, item)
in_keep_pattern
to be an additional cause for slowness, because it has to compile the search pattern on each invocation (a slow operation in regex parsing). It might be worthwhile to compile the pattern into a regex object only once, and change_keep_pattern
,_remove_pattern
to directly accept regex objects, instead of string patterns.