oracc / pyoracc

Python tools for working with ORACC
GNU General Public License v3.0
12 stars 10 forks source link

Sanitise regular expressions in lexer #78

Closed ageorgou closed 5 years ago

ageorgou commented 5 years ago

Some of the lexer rules specify their regular expressions using strings which will become invalid in a future Python version (a DeprecationWarning is raised from version 3.6 onwards).

These are:

The simplest way would be to double escape (\\.) the special characters there. Converting to raw strings might not be an option, because these strings also contain unicode escape sequences (\u2019), which may or may not work within raw strings (I haven't tested it). In fact, some of the characters in them don't need to be escaped at all, but we will need to check what the intention for each regex was (and double check that it actually matches the intended thing correctly now...).

rillian commented 5 years ago

Aha, I think I was getting confused among the different Python versions.

It seems the \u escapes wouldn't actually be a problem in python 3, where the re module will expand them, so including them in a raw string is fine. Unfortunately this won't work for python 2. One possibility is to use the external regex package on python 2:

>>> import re, regex

>>> re.search('\u0041', 'ABC')
<no match>

>>> regex.search('\u0041', 'ABC')
<regex.Match object; span=(0, 1), match='A'>

I did think there were a suspicious number of backslash escapes in some of the patterns. Glad it's not just me!

ageorgou commented 5 years ago

Unfortunately, the ply library which pyoracc uses works on top of the standard library re module, and I'm not sure whether we can subsitute regex for that. It's worth keeping in mind though!

rillian commented 5 years ago

I believe it's possible to hook module loading to substitute a compatible dependency. That's a bit fancy though, but might me acceptable since it could be specific to python2, which is near the end of its support window.

However, it turns out Python will concatenate r'' and u'' strings without complaint, so I think a better approach to the invalid escape sequence warnings is to just split the unicode portions of the pattern strings into a separate literal and use raw strings for the re sequences. I have a patch for that.

rillian commented 5 years ago

I also went through some of the other escape sequences. Marking the strings as raw silenced the warnings, but they're still unnecessary. In particular, only ^$.*+? and brackets need escaping in normal patterns; other punctuation is fine. Inside the [] only ] and - need escaping.