pfalcon / re1.5

re1, the unbloated regexp engine by Russ Cox, elaborated to be useful for real-world applications
BSD 3-Clause "New" or "Revised" License
42 stars 4 forks source link

NUL-char clean #22

Open ampli opened 9 years ago

ampli commented 9 years ago

Form the README:

However, this feature cannot actually be used, because compilecode() is not NUL-char clean!

I propose to change it to be NUL-char clean. To that end, I propose to change its first argument from char *re to Subject *re. This will also allow it to return negative int on errors (as I proposed in issue #18).

If this sounds fine, I will send a pull request.

pfalcon commented 9 years ago

Matchers are NUL-char clean and take size of the input string as a param.

That means that subject string NUL-clean, not that regexp pattern NUL-clean.

ampli commented 9 years ago

The Python3 manual says: Regular expression pattern strings may not contain null bytes, but can specify the null byte using the \number notation, e.g., '\x00'

However, in both Python3 and uPy, \x00 in a string is translated to NUL-char, as expected:

$ python3 -c 's="a\x00b"; print(s[1], s[2])' | cat -v
^@ b
./micropython -c 's="a\x00b"; print(s[1], s[2])' | cat -v
^@ b

So it looks there is a real NUL-char there, and the strings are NUL-char clean, because also b is printed.

But Python3 supports Null-char in regex'es, while uPy doesn't, because compilecode() is not NUL-char clean:

$ python3 -q
>>> import re
>>> re.match("a\x00JUNK$", "ab")
>>> re.match("a\x00JUNK$", "a\x00")
>>> re.match("a\x00JUNK$", "a\x00JUNK")
<_sre.SRE_Match object; span=(0, 6), match='a\x00JUNK'>
>>> 
$

Only the last one matches. Note that Python3 matches the NUL-char in the pattern to the NUl-char in the subject string.

But:

$ ./micropython
Micro Python v1.4.4-96-g115afdb-dirty on 2015-07-15; linux version
>>> import ure as re
>>> re.match("a\x00JUNK$", "ab")
<match num=1>
>>> 
$

So we see that compilecode()'s stopping on NUL-char cause incompatibility with Python3, and is not able at all to search for NUL-char in the subject string - i.e. the feature of NUL-clean Subject string cannot be directly used (can only be matched to a dot).

Since making compilecode() NUl-char clean is easy, and will make uPy regex match compatible to Python3 in this respect, and is also a useful feature (to search binary data), so why not to do it?

pfalcon commented 9 years ago

The README says:

Matchers are NUL-char clean and take size of the input string as a param.

And indeed they're:

$ micropython 
Micro Python v1.4.4-108-g106c4b9 on 2015-07-19; linux version
>>> ure.search("a(.*)c", "a\0\0\0\0c").group(1)
'\x00\x00\x00\x00'

I also thought re1.5 supports quoted hex syntax, but it doesn't. But then adding this feature is certainly more important than making re1.5 nul-in-regex-clean. Note that even PCRE doesn't support NULs in regex, necessitating such patches: https://github.com/micropython/micropython-lib/commit/0373045505defae1d18dc3694e5a7ee8c1a56b33 (PCRE via FFI is alternative regex engine supported by uPy, and one which is used for real-world cases like stdlib). I'm not against making it "better than PCRE", but what about devising way to do generalized assertions for finite-automaton matchers? ;-)

Since making compilecode() NUl-char clean is easy, and will make uPy regex match compatible to Python3 in this respect, and is also a useful feature (to search binary data), so why not to do it?

I'm not against it, but making re1.5 being able to have \0 in regex, at the price of less convenient API, won't make it "better". Adding missing features would. If this lies on your critical path to these new features, let's fix it, but otherwise, I'd downprioritize it until later.

ampli commented 9 years ago

And indeed they're:

$ micropython 
Micro Python v1.4.4-108-g106c4b9 on 2015-07-19; linux version
>> ure.search("a(.*)c", "a\0\0\0\0c").group(1)
'\x00\x00\x00\x00'

the feature of NUL-clean Subject string cannot be directly used (can only be matched to a dot).

Yes, in yPy you can currently only match a dot to NUL-char in the subject string, as opposed to Python3, in which you can match to it a NUL-char in the subject string, as I demonstrated.

$ ./micropython
Micro Python v1.4.4-96-g115afdb-dirty on 2015-07-15; linux version
>>> import ure
>>> ure.match("a\x00b$", "a")
<match num=1>

I the above example, the \x00 is directly translated by uPy (and also Python3) to a NUL-char in the regex, which actually causes copmpilecode() to terminate the pattern processing when seeing it, ignoring the b$. On the other hand, in Python3 this `b$" is not ignored (despite what seem to be implied from the manual), i.e. Python3 doesn't terminate the processing of the pattern when encountering a NUL-char in it (so NUL-char in the pattern can be specified also as "\x00" and not only as "\x00").

but making re1.5 being able to have \0 in regex, at the price of less convenient API, won't make it "better".

It will make it "better" in the aspect of being compatible to what Python3 does in the exact same situation.

micropython/micropython-lib@0373045 html.parser: PCRE cannot handle literal NULs, requires quoted hex repr.

But Python3 actually can - according to the examples I brought that compares the same search in Python3 and uPy (unless I missed something - please indicate). Is not the goal of uPy is to be compatible with Python3?

also thought re1.5 supports quoted hex syntax, but it doesn't. But then adding this feature is certainly more important than making re1.5 nul-in-regex-clean.

I have no problem to add this feature.

but what about devising way to do generalized assertions for finite-automaton matchers? ;-)

Do you mean look-ahead/behind?

BTW, currently re1.5 doesn't implement $ correctly. It implements it like \Z. From the Python3 manual:

'$'
Matches the end of the string or just before the newline at the end of the string

I can fix that and add \Z. I can also add \b and \B.

pfalcon commented 9 years ago

Is not the goal of uPy is to be compatible with Python3?

Yes, but the goal of re1.5 is to be easily reusable, small regex library ;-). It won't be able to support all Python3 features anyway, and between spending next +50 bytes on X or Y features, X and Y should be prioritized. That was the hint - from my outside look, you seem to work on elaborating corner cases, instead of adding something exciting (which was in your original list). But well, if you keen to work on this, please do - all this would need to be done eventually of course, and if you feel like doing t now, well, maybe it's only better ;-).

BTW, currently re1.5 doesn't implement $ correctly. It implements it like \Z.

Yes, that's know TODO from README too:

  • Support for matching flags like case-insensitive, dot matches all, multiline, etc.
  • Support for more assertions like \A, \Z.

So, if you feel like implementing it, please do.

ampli commented 9 years ago

Support for matching flags like case-insensitive, dot matches all, multiline, etc.

To support that, I would like to add a flags argument to the matching functions, but I don't want to add just another argument to the recursive functions.

To that end I need your input on my proposal of a parameter block ( in #18), that is needed also in order to support adding recursive-limit counter and regex loop-prevention mechanism without adding more function arguments (both already implemented in my development code).

ampli commented 9 years ago

More on:

Support for matching flags like case-insensitive, dot matches all, multiline, etc.

A flag parameter to support some of these features may need to be added to compilecode(). But the matching functions needs a flags parameter (starting from their API) for RE_ANCHORED, RE_DOTALL, and maybe more. A parameter block will make it less painful to pass flags to the recursive functions.