python / cpython

The Python programming language
https://www.python.org
Other
62.33k stars 29.94k forks source link

Improve performances of `fnmatch.translate` #122288

Open picnixz opened 1 month ago

picnixz commented 1 month ago

Feature or enhancement

Proposal:

I implemented fnmatch.translate and fnmatch.filter in C in #121446 but the performances for fnmatch.filter are less pronounced than those in fnmatch.translate. While I believe that the factor 2x improvement brought by the C implementation is worthwhile for fnmatch.translate, this comes at a high maintenance cost. Instead, I translated (no pun intended) the implementation in Python and obtained the following timings (PGO build):

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-py  |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 3.99 us: 1.53x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.39 us               | 4.07 us: 1.57x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.24 us               | 1.51 us: 1.49x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.97 us               | 1.12 us: 1.76x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.00 us               | 1.21 us: 2.48x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.40 us               | 3.33 us: 1.62x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 1.71x faster          |
+------------------------------------------+-----------------------+-----------------------+

It's not as good as the benchmarks reported in https://github.com/python/cpython/issues/121445, but I think this could be considered. Note that the improvements are brought by a new (but equivalent) algorithm during the translation phase. Currently, the translation algorithm does:

Our (Barney and me) implementation instead does the following:

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

barneygale commented 1 month ago
  • Instead of having many chunks with 1 or 2 characters, I try to have as many chunks with strings that are already formatted, e.g., ["abc", "*"] instead of ["a", "b", "c", SENTINEL].

The current implementation does this to minimise string concatenation, i.e. it defers join()ing until its unavoidable. This should be faster, so I'm surprised you're getting different results! It's quite possible I've misunderstood your patch though?

picnixz commented 1 month ago

Thanks for noticing this. I think there are two things to consider:

Calling re.escape(c) at every loop iteration (most of the time we don't have *?[ but rather "normal" characters to be escaped) is probably more costly. You have a call to isinstance and a call to str.translate. By looking at the implementation of str.translate, each time str.translate is called, a writer is allocated on the heap, so it slows down the entire algorithm at every call. In particular, if I need to successively call res.append(re.escape(c)) a lot of time (which is what happens in practice), then it's probably faster to do res.append(re.escape(''.join(x))) or res.extend(re.escape(c) for c in x).

My benchmarks show that res.append(re.escape(''.join(x))) is faster than res.extend(re.escape(c) for c in x). I suspect this is clearly because of the overhead of re.escape.

Now, there is one case where the new algorithm might be slower. This is when you only have a single character in the pending list of characters to escape, namely x = [c]. You can see that the improvements are less pronounced for the case a**?**cd**?**??k*** but still faster. I've added an other case to illustrate an extreme case:

+---------------------------------+-----------------------+-----------------------+
| Benchmark                       | fnmatch-translate-ref | fnmatch-translate-py  |
+=================================+=======================+=======================+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o** | 5.09 us               | 4.95 us: 1.03x faster |
+---------------------------------+-----------------------+-----------------------+

Here the performances are almost identical, and this is what I would have expected. I've also added a comment:

# Handling '?' one at a time seems to more efficient
# even if there are consecutive '?' that could have
# been written directly.

The reason why it's more efficient is because the fast path is actually rarer than the slow path. So, looping over the consecutive '?' is heuristically less efficient. The same analysis could be applied to the case of consecutive * so I compared two implementations (the one in the PR and the one where I keep the treatment of * by not advancing the index in the if c == '*' but guarding the add(STAR) with another if). Benchmarks suggested that using a while to skip over successive * is slightly faster for the other cases (the case a*b*c*d*e*f*g*h*i*j*k*l*m*n*o** is obvisouly slower with a while loop but not that much):

With an inner while-loop to skip over consecutive '*' (PR implementation)

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-py  |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.06 us               | 4.92 us: 1.23x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.37 us               | 5.06 us: 1.26x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.21 us               | 2.00 us: 1.10x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.99 us               | 1.56 us: 1.28x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.04 us               | 1.50 us: 2.03x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.40 us               | 5.25 us: 1.03x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 1.29x faster          |
+------------------------------------------+-----------------------+-----------------------+
With a single if to skip over consecutive '*' (same as the ref implementation)

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-py  |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 5.13 us: 1.19x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.36 us               | 5.23 us: 1.22x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 2.01 us               | 1.64 us: 1.22x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.10 us               | 1.48 us: 2.09x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.44 us               | 5.14 us: 1.06x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 1.26x faster          |
+------------------------------------------+-----------------------+-----------------------+

Benchmark hidden because not significant (1): a**?**cd**?**??k***

One reason is that using an inner while saves a check on whether we need to push the pending characters or not and that we can jump faster to the next character that is not *.

barneygale commented 1 month ago

Thank you very much for the explanation!

Q: could the cost of re.escape() be ameliorated with an lru_cache() instead? e.g.:

diff --git a/Lib/fnmatch.py b/Lib/fnmatch.py
index 73acb1fe8d4..dc9bf122e39 100644
--- a/Lib/fnmatch.py
+++ b/Lib/fnmatch.py
@@ -149,11 +149,14 @@ def _translate(pat, STAR, QUESTION_MARK):
                         stuff = '\\' + stuff
                     add(f'[{stuff}]')
         else:
-            add(re.escape(c))
+            add(_escape_char(c))
     assert i == n
     return res

+_escape_char = functools.lru_cache(maxsize=32768)(re.escape)
+
+
 def _join_translated_parts(inp, STAR):
     # Deal with STARs.
     res = []
picnixz commented 4 weeks ago

It could be, I didn't check this alternative. Also, we could simply make it faster by a simple dictionary lookup because the implementation is private and we're supposed to have only strings being passed. Alternatively, we could inline the escaping directly. I'll try your suggestion today.

picnixz commented 4 weeks ago

Huh, so your approach is actually the best (the cache is cleared before each benchmark but since there are many iterations for each row, the first call which caches the result is probably negligible).

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-py  |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 3.99 us: 1.53x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.39 us               | 4.07 us: 1.57x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.24 us               | 1.51 us: 1.49x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.97 us               | 1.12 us: 1.76x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.00 us               | 1.21 us: 2.48x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.40 us               | 3.33 us: 1.62x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 1.71x faster          |
+------------------------------------------+-----------------------+-----------------------+

There are significant performance gains for each runs. I totally forgot about having the possibility of caching re.escape. ~I'll try to even reduce the memory cost by simply inlining re.escape.~ (no need to have that much micro-optimization; it will be hard to maintain so let's keep your solution)

picnixz commented 4 weeks ago

I'll also update the C implementation. I discovered two other optimizations that I could on the C side but I think the C implementation is not that worth it unless you want a x3 improvement (here we are roughly x1.75, but I think we can go up to x2.8-x3.5 in C).

EDIT: We eventually were able to gain a factor 6 improvement but since the maintaince cost is high (I would be the one responsible for it probably and I should confess that I'd prefer maintining the Python implementation), I decided to give up on the C implementation.