Open 179a554f-eaff-458f-ab05-8166251cacbd opened 9 years ago
In Modules/sre_lib.h on line 882 [1], a block of memory is allocated. If SRE(match) function later terminates abruptly, either because of a signal or because subsequent memory allocation fails, this block is never released.
[1] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l882
There is maybe a bug. Can you show an example of regex and a text where the memory leak occurs? You can use the tracemalloc module to check if there is a memory leak. Or use sys.getcounts() if you compiled Python in debug mode.
sre_lib.h is very complex, it uses the C instruction "goto" with regex bytecodes.
"DO_JUMP(JUMP_REPEAT, jump_repeat, ctx->pattern+ctx->pattern[0]);" calls "goto entrace" to execute following bytecodes, but later it comes back after DO_JUMP() with the "jump_repeat:" label:
https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l1180
Memory leak only happens if match operation terminates abruptly, e.g. because of SIGINT. In this case, DO_JUMP doesn't come back.
Tracemalloc code:
import re
import signal
import tracemalloc
class AlarmError(Exception):
pass
def handle_alarm(signal, frame):
raise AlarmError
signal.signal(signal.SIGALRM, handle_alarm)
s1 = tracemalloc.take_snapshot()
for _ in range(20):
try:
signal.alarm(1)
re.match('(?:a|a|(?=b)){1000}', 'a'*999)
raise RuntimeError
except AlarmError:
pass
s2 = tracemalloc.take_snapshot()
res = s2.compare_to(s1, 'lineno')
for e in res[:10]:
print(e)
For me, it shows almost 3 MiB allocated in re.py.
May be this patch helps.
Oh cool, you wrote a script to reproduce the issue! And Serhiy wrote a patch, great! Great job guys.
sre_clean_repeat_data.patch looks good to me.
@Serhiy: Can you try the example to ensure that it fixes the issue? If yes, go ahead!
This patch doesn't fix the issue. The problem is that the list starting with state->repeat doesn't necessarily contains all repeat contexts that are allocated. Indeed, here [1] and here [2] repeat contexts are temporarily removed from the list. If the match procedure terminates abruptly, they are not added back.
[1] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l963 [2] https://hg.python.org/cpython/file/c89f7c34e356/Modules/sre_lib.h#l1002
Try to allocate SRE_REPEAT on state's stack, the performance has not changed significantly.
It passes the other tests, except this one (test_stack_overflow): https://github.com/python/cpython/blob/v3.8.0a1/Lib/test/test_re.py#L1225-L1230
I'll try to fix bpo-35859, bpo-9134 first.
PR11926 (closed) tried to allocate SRE_REPEAT on state's stack. It's feasible, but messes up the code in sre_lib.h, and reduces performance a little (roughly 6% slower), so I gave up this solution.
PR12160 uses a memory pool, this solution doesn't mess up the code.
🔸For infrequent alloc/free scenes, it adds a small overhead:
s = 'a'
p = re.compile(r'(a)?')
p.match(s) # <- measure this statement
before patch: 316 ns +- 19 ns after patch: 324 ns +- 11 ns, 2.5% slower. (by perf module)
🔸For very frequent alloc/free scenes, it brings a speedup:
s = 200_000_000 * 'a'
p = re.compile(r'.*?(?:bb)+')
p.match(s) # <- measure this statement
before patch: 7.16 sec after patch: 5.82 sec, 18.7% faster. (best of 10 tests)
🔸I tested in a real case that use 17 patterns to process 100MB data:
before patch: 27.09 sec after patch: 26.78 sec, 1.1% faster. (best of 4 tests)
My PR methods are suboptimal, so I closed them.
The number of REPEAT can be counted when compiling a pattern, and allocate a SRE_REPEAT
array in SRE_STATE
(with that number items).
It seem at any time, a REPEAT will only have one in active, so a SRE_REPEAT
array is fine.
regex module does like this:
https://github.com/mrabarnett/mrab-regex/blob/hg/regex_3/_regex.c#L18287-L18288
Can the number of REPEAT be placed in SRE_OP_INFO
?
And add a field to SRE_OP_REPEAT
to indicate the index of this REPEAT.
This looks promising. Please, go ahead! You are free to add any fields to any opcodes. It may break some third-party code which generates compiled patterns from a sequence of opcodes, it the stability of this interface was not promised. And they will be broken in any case due to reorganizing of internal code (bpo-47152).
New changeset 6e3eee5c11b539e9aab39cff783acf57838c355a by Ma Lin in branch 'main': bpo-23689: re module, fix memory leak when a match is terminated by a signal or memory allocation failure (GH-32283) https://github.com/python/cpython/commit/6e3eee5c11b539e9aab39cff783acf57838c355a
Thank you Ma Lin for all your work.
The fix changes interfaces of some internal functions which can be used in third-party code, and the bug occurs only in special circumstances, so it is not practical to backport it.
PR https://github.com/python/cpython/pull/93882 reverts the commit that resolved this as it causes a major performance and usability regression: https://github.com/python/cpython/issues/91404.
Both PRs also add private method _fail_after()
in the re.Pattern
class which allows to simulate user interruption. After calling _fail_after(n, E)
the next matching operation with this pattern will raise exception E after processing n opcodes. This allows stable reproduction of memory leaks.
Example from https://github.com/python/cpython/issues/91404#issuecomment-1147876699 which could not finish in a hour with previous attempt (#32283), now return the result instantly.
re.findall(r'(?:(?:/[^\n]*)?\n)*/', '../utils/common":3}],15:[function(t,e,a){"use strict";e.exports=function(){this.input=null,this.next_in=0,this.avail_in=0,this.total_in=0,this.output=null,this.next_out=0,this.avail_out=0,this.total_out=0,this.msg="",this.state=null,this.data_type=2,this.adler=0}},{}],"/":[function(t,e,a){"use strict";var i={};(0,t("./lib/utils/common").assign)(i,t("./lib/deflate"),t("./lib/inflate"),t("./lib/zlib/constants")),e.exports=i},{"./lib/deflate":1,"./lib/inflate":2,"./lib/utils/common":3,"./lib/zlib/constants":6}]},{},[])("/")});\n')
Example from https://github.com/python/cpython/issues/91404#issuecomment-1148113847 also returns an expected result.
m = re.match(r'((a)?)*', 'aaaaa')
assert m.span(0) == (0, 5)
assert m.span(1) == (5, 5)
@wjssz, could you please take a look at these PRs?
#126843 is not feasible. If a SRE_REPEAT
struct was not registered in state->repstack
, and a signal aborts the matching, then it can't be released.
Maybe #126843 works. But it requires a deeper understanding of the hiberarchy of SRE_REPEAT
and bringing this complexity into the engine code. This may confuse readers of the code.
add_repeat()/remove_repeat()
functions always malloc/free the struct, this may be inefficient.Below is my patch, it uses a doubly-linked list for used pool, a singly-linked list for unused pool. It puts released structs to unused pool, so that these structs can be reused. It's faster when malloc/free structs very frequently.
My patch compare to #126840 (On Windows, MSVC2022, non-PGO release build):
### regex_dna ###
Mean +- std dev: 139 ms +- 1 ms -> 139 ms +- 1 ms: 1.00x slower
Not significant
### regex_effbot ###
Mean +- std dev: 2.45 ms +- 0.04 ms -> 2.20 ms +- 0.01 ms: 1.11x faster
Significant (t=47.54)
### regex_v8 ###
Mean +- std dev: 21.6 ms +- 0.2 ms -> 22.0 ms +- 0.2 ms: 1.02x slower
Not significant
My benchmark (it uses 16 patterns to process 100MB real data):
baseline: 12.104 sec
#126840: 12.767 sec
My patch: 12.597 sec
Both PRs also add private method _fail_after() in the re.Pattern class which allows to simulate user interruption. After calling _fail_after(n, E) the next matching operation with this pattern will raise exception E after processing n opcodes. This allows stable reproduction of memory leaks.
IMO _fail_after()
method is not necessary. It makes the code complex, for situations where it's almost never used again. The code above triggers the memory leak with signal, maybe it can be included in unit-test? But I don't think it's necessary either.
This GitHub account can't upload file, you can save the patch, and apply it to main
branch:
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/serhiy-storchaka' closed_at =
created_at =
labels = ['expert-regex', '3.11', 'performance']
title = 'Memory leak in Modules/sre_lib.h'
updated_at =
user = 'https://bugs.python.org/abacabadabacaba'
```
bugs.python.org fields:
```python
activity =
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date =
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation =
creator = 'abacabadabacaba'
dependencies = []
files = ['38529']
hgrepos = []
issue_num = 23689
keywords = ['patch']
message_count = 13.0
messages = ['238313', '238320', '238324', '238326', '238328', '238343', '238388', '335889', '337113', '416266', '416269', '416630', '416631']
nosy_count = 7.0
nosy_names = ['vstinner', 'ezio.melotti', 'mrabarnett', 'abacabadabacaba', 'serhiy.storchaka', 'malin', 'alexei.romanov']
pr_nums = ['11926', '12160', '32188', '32223', '32283']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue23689'
versions = ['Python 3.11']
```
Linked PRs