The findAll algorithm calls find repeatedly to find the next match, however this may take quadratic time, ex: findAll("aaaa", re"\w+\d|\w") this will call find 4 times and consume the entire string every time. I believe only regex with alternations where a term consumes the whole string while a second term matches part of it are affected.
This PR implements a findAll that always takes linear time. The caveat is it may take linear space, ex: the above example needs to store all the matches for the second term of the alternation until it finally fails to match the first term.
The boundaries are two machine words, so it may take 16x the size of the string. I think this is pretty bad, but it's the only way to guarantee linear time AFAICT.
This could support streams (#14), except it may consume unbounded memory, so no. On a second thought, I think it cannot possibly consume unbounded memory, memory is bounded by the regex len, so it's fixed memory. Captures could take unbounded space in their current form (full parse tree), but they would be bounded once only last capture is returned for every capture group (to be implemented).
The
findAll
algorithm callsfind
repeatedly to find the next match, however this may take quadratic time, ex:findAll("aaaa", re"\w+\d|\w")
this will callfind
4 times and consume the entire string every time. I believe only regex with alternations where a term consumes the whole string while a second term matches part of it are affected.This PR implements a
findAll
that always takes linear time. The caveat is it may take linear space, ex: the above example needs to store all the matches for the second term of the alternation until it finally fails to match the first term.The boundaries are two machine words, so it may take 16x the size of the string. I think this is pretty bad, but it's the only way to guarantee linear time AFAICT.
This could support streams (#14),
except it may consume unbounded memory, so no. On a second thought, I think it cannot possibly consume unbounded memory, memory is bounded by the regex len, so it's fixed memory. Captures could take unbounded space in their current form (full parse tree), but they would be bounded once only last capture is returned for every capture group (to be implemented).