Where is located the split between two zones when there is more than one possibility?

mcoulont commented 1 year ago

Hello.

In the following code the string AB=2, can be split 3 different ways:

regex1 = "[0-9A-Z]+"
regex2 = "[A-Z]+=[0-9]+"
regex = "(" + regex1 + ")|(" + regex2 + ")"

@with_pattern(regex)
def parse_AlphanumericMaybeInEquality(text):
    return text

print(parse(
            "{beginning:AlphanumericMaybeInEquality}{end}",
            "AB=2,",
            dict(AlphanumericMaybeInEquality=parse_AlphanumericMaybeInEquality)
        ))

If the shortest first term is the choice, beginning='A' and end='B=2,'

If the longest first term is the choice, beginning='AB=2' and end=',':

print(re.match('^' + regex2 , "AB=2,"))

returns

<re.Match object; span=(0, 4), match='AB=2'>

Yet we get neither:

<Result () {'beginning': 'AB', 'end': '=2,'}>

What's supposed to happen?

Thanks for your work

jenisys commented 1 year ago

Basically, you as provider of the pattern/type-definition/type-converter are:

aware of ambiguities (of your pattern and your data that you are using)
can provide the correct/best definition how to handle that

Therefore, if you provide:

regex = "(" + regex2 + ")|(" + regex1 + ")"
# OR BETTER: regex = f"({regex2})|({regex1})"

you get the expected result:

<Result () {'beginning': 'AB=2', 'end': ','}>

NOTES:

The (...|...) choice mechanism in regex seems to executed sequentially.
Therefore, put the longest possible match to the begin of your list of choices that should be matched.
You can easily check by using: re.match(regex, "AB=2,")

OTHERWISE:

You need to use more advanced mechanisms, like: parse_type.TypeBuilder.make_variant()

mcoulont commented 1 year ago

OK thanks I've learnt something today.

Sorry to have disturbed.

r1chardj0n3s / parse

Where is located the split between two zones when there is more than one possibility? #161