Closed ageorgou closed 5 years ago
Aha, I think I was getting confused among the different Python versions.
It seems the \u
escapes wouldn't actually be a problem in python 3, where the re
module will expand them, so including them in a raw string is fine. Unfortunately this won't work for python 2. One possibility is to use the external regex package on python 2:
>>> import re, regex
>>> re.search('\u0041', 'ABC')
<no match>
>>> regex.search('\u0041', 'ABC')
<regex.Match object; span=(0, 1), match='A'>
I did think there were a suspicious number of backslash escapes in some of the patterns. Glad it's not just me!
Unfortunately, the ply
library which pyoracc uses works on top of the standard library re
module, and I'm not sure whether we can subsitute regex
for that. It's worth keeping in mind though!
I believe it's possible to hook module loading to substitute a compatible dependency. That's a bit fancy though, but might me acceptable since it could be specific to python2, which is near the end of its support window.
However, it turns out Python will concatenate r''
and u''
strings without complaint, so I think a better approach to the invalid escape sequence warnings is to just split the unicode portions of the pattern strings into a separate literal and use raw strings for the re
sequences. I have a patch for that.
I also went through some of the other escape sequences. Marking the strings as raw silenced the warnings, but they're still unnecessary. In particular, only ^$.*+?
and brackets need escaping in normal patterns; other punctuation is fine. Inside the []
only ]
and -
need escaping.
Some of the lexer rules specify their regular expressions using strings which will become invalid in a future Python version (a
DeprecationWarning
is raised from version 3.6 onwards).These are:
t_ID
t_transctrl_ID
terminates_para
The simplest way would be to double escape (
\\.
) the special characters there. Converting to raw strings might not be an option, because these strings also contain unicode escape sequences (\u2019
), which may or may not work within raw strings (I haven't tested it). In fact, some of the characters in them don't need to be escaped at all, but we will need to check what the intention for each regex was (and double check that it actually matches the intended thing correctly now...).