unicode regex issue - Githubissues

pickettj commented 4 years ago

@iamlemec

I have a dictionary of texts consisting of a list of paragraphs. Those "paragraphs" all begin with a manually entered line number, e.g.:

5.23.8aud agar ēg abāz dāštan nē šāyist pad abdom abāz dāštan čiyōn šāyēd

I am attempting separate the number and the line into separate list elements. I have the first part of that working already:

pahlavi_corpus_lines = {}
for work in pahlavi_corpus:
    segment = {}
    for para in pahlavi_corpus[work]:
        num_pattern = re.compile(r"^[0-9]{0,3}\.[0-9]{0,3}\.?[0-9]{0,3}")
        num_match = re.match(num_pattern, pahlavi_corpus[work][para])
        if num_match:
            num = num_match.group(0)
        else:
            num = "--"
        line_pattern = re.compile(r"[^(0-9|\.)]*")
        line_match = re.match(line_pattern, pahlavi_corpus[work][para])
        if line_match:
            line = line_match.group(0)
        else:
            line = "-----"
        segment[para] = [num, line]
    pahlavi_corpus_lines[work] = segment

However, extracting the lines themselves is not working (error message: "AttributeError: 'NoneType' object has no attribute 'group'"), even though I'm pretty sure my regex is fine. I believe the issue is my irregular characters (e.g. ud čē rāy pēš nē āmad), which requires some kind of special unicode instructions. But the solutions I'm finding do not seem to work. E.g. adding a unicode flag does not seem to work (re.compile(r"hanger", re.UNICODE)), and I think the u flag may only be for Python2 (?) (re.compile(ur"hanger").

Help?

iamlemec commented 4 years ago

It seems impossible that you would end up trying to access group if line_match or num_match is None, but just in case, I think you should make those if statements if num_match is not None and if line_match is not None.

Also for the second regex, I think you want ^[(0-9|\.)]* instead.

pickettj commented 4 years ago

@iamlemec

That doesn't seem to quite do the trick, but the following might help isolate the problem:

test_string = "5.23.5ud čē rāy pēš nē āmad"
num_pattern = re.compile(r"ud")
result = re.match(num_pattern, test_string)
print(result.group(0))

That code results in the following error message:

AttributeError Traceback (most recent call last)
in () 2 num_pattern = re.compile(r"ud") 3 result = re.match(num_pattern, test_string) ----> 4 print(result.group(0)) AttributeError: 'NoneType' object has no attribute 'group'

However, the regex code I used for the number works just fine:

test_string = "5.23.5ud čē rāy pēš nē āmad"
num_pattern = re.compile(r"^[0-9]{0,3}\.[0-9]{0,3}\.?[0-9]{0,3}")
result = re.match(num_pattern, test_string)
print(result.group(0))

Returns the number 5.23, as intended. But the weird part is that ud isn't even a special character, it's just a simple string match, so I don't get it.

Doesn't shifting the carrot the way you suggested (^[(0-9|\.)]*) turn that into a front anchor? I want to use it to exclude. When I tested it, [^(0-9|\.)]* resulted in a match for everything but the initial line number, e.g. 5.23.5ud čē rāy pēš nē āmad (bolded signifying match)

iamlemec commented 4 years ago

For the first part, keep in mind that match only matches things from the start of the string. You need to use search if you want it to look anywhere in the string. So that would explain the behavior with "ud".

Yeah, I was confused about the ^. You're right there. If you want to get the rest of the string though, just use the match object returned from the first regex. You can call group.end() to get the end position of the numeric match and index the string from that point onward.

pickettj / pahlavi_digital_projects

unicode regex issue #2