Incremental Parsing? - Githubissues

mentalisttraceur commented 2 years ago

So one thing that comes up in real-world parsing, is parsing part of a string, up to the end of the next parsed thing.

In other words you might want something like this:

remaining_text = '01-23-45-67-89-AF whatever ...'

mac, remaining_text = parse_next(remaining_text, MAC)

assert mac == MAC('01-23-45-67-89-AF')
assert remaining text == 'whatever ...'

And of course you want some sort of sensible error behavior.

Notably, right now macaddress doesn't help you do this. It assumes you already have strings which are the right size to pass in to the class constructor or parse.

Of course, neither does ipaddress... is this a deficit on the part of ipaddress? Or should this kind of functionality be provided somehow by higher-level wrapper libraries?

Notably:

If an incremental parsing library can't share parsing logic, then it has to reimplement a lot of the same parsing logic.
If an incremental parsing library has to reuse ipaddress or macaddress as it is right now, then they must necessarily take on an inefficiency:

Imagine you have a string that you want to parse for the next "token" being either an EUI48 or OUI literal. (Maybe you're writing a mini language for working with networking settings, like for scripting a router or something.)

(Quick aside: when parsing stuff from the front of something longer, parses might want to do either a greedy match or conservative match: if you see the string 01-02-03-04-05-06, do you conservatively match just OUI('01-02-03') and leave -04... unconsumed, or do you greedily match the whole EUI48('01-02-03-04-05-06')? In this use-case I think it's obvious that you'd want to do the greedy match, but in some cases you might want the conservative one.)

So the point is, from an efficiency perspective, ideally the parsing consumes 01-02-03, and then decides based on the next character whether or not it is done (and has parsed an OUI instance) or needs to continue to try to parse a full EUI48 instance.

(There's another permutation: if the next character is -, then sometimes you want to consider the OUI possibility as eliminated, so that if the rest doesn't parse into a full EUI48, it's now an error, and other times you want to consider the OUI possibility still "live", so that if the greedy match fails, you treat it as if you just parsed an OUI instance and stopped there, so that the next step of the parser can read from right after the OUI.)

In all of these cases though, it kinda makes sense to me to parse all the possibilities incrementally - check each character against all parse possibilities - the way I do in macaddress._parse. And while the logic of macaddress._parse assumes a string containing one complete address, and only tries the formats which match the size, it's probably pretty close to what an incremental parser would want in that case.

So it seems like a shame that I've written some careful logic, which can parse canonical and custom formats into all these hardware address types represented as distinctly typed objects in memory, and yet if someone wanted to write parser which parses hardware addresses from the next characters in a string, they'd basically have to reimplement a lot of what I did, or use what I did at the cost of a lot of boilerplate and inefficiency.

mentalisttraceur commented 2 years ago

Tentative verdict is to not bother supporting this, but to give it a big think.

Basically, I'd like to think up what a good implementation of this from outside this library looks like, especially one that still constructs instances of objects that this library provides.

mentalisttraceur commented 2 years ago

So I think these are the functons (untested, and obviously the majority is repetitive bulk that could probably be factored out):

def parse_longest(characters, *classes):
    candidates = {}
    for cls in classes:
        for format_ in cls.formats:
            candidates.setdefault(format_, cls)
    candidates = sorted(candidates.items())
    address = 0
    start = 0
    end = len(candidates)
    last_match = None
    last_match_length = 0
    for index, character in enumerate(characters):
        if character in _HEX_DIGITS:
            address <<= 4
            address += int(character, 16)
            character = 'x'
        elif character == 'x':
            character = ''
        while start < end and candidates[start][0][index] < character:
            start += 1
        while start < end and candidates[end - 1][0][index] > character:
            end -= 1
        if start >= end:
            break
        if len(candidates[start][0]) == index + 1:
            _, cls = candidates[start]
            offset = (4 - cls.size) & 3
            last_match = cls(address >> offset)
            last_match_length = index + 1
            start += 1
    if last_match is None:
        raise _value_error(string, 'cannot be parsed as', *classes)
    return last_match, last_match_length

def parse_longest_no_backtrack(characters, *classes):
    candidates = {}
    for cls in classes:
        for format_ in cls.formats:
            candidates.setdefault(format_, cls)
    candidates = sorted(candidates.items())
    address = 0
    start = 0
    end = len(candidates)
    for index, character in enumerate(characters):
        if character in _HEX_DIGITS:
            address <<= 4
            address += int(character, 16)
            character = 'x'
        elif character == 'x':
            character = ''
        while start < end and candidates[start][0][index] < character:
            start += 1
        while start < end and candidates[end - 1][0][index] > character:
            end -= 1
        if start >= end:
            raise _value_error(string, 'cannot be parsed as', *classes)
        if len(candidates[start][0]) == index + 1:
            if start == end - 1:
                break
            start += 1
    _, cls = candidates[start]
    offset = (4 - cls.size) & 3
    address >>= offset
    return cls(address), (index + 1)

def parse_shortest(characters, *classes):
    candidates = {}
    for cls in classes:
        for format_ in cls.formats:
            candidates.setdefault(format_, cls)
    candidates = sorted(candidates.items())
    address = 0
    start = 0
    end = len(candidates)
    matches = []
    for index, character in enumerate(characters):
        if character in _HEX_DIGITS:
            address <<= 4
            address += int(character, 16)
            character = 'x'
        elif character == 'x':
            character = ''
        while start < end and candidates[start][0][index] < character:
            start += 1
        while start < end and candidates[end - 1][0][index] > character:
            end -= 1
        if start >= end:
            raise _value_error(string, 'cannot be parsed as', *classes)
        if len(candidates[start][0]) == index + 1:
            break
    _, cls = candidates[start]
    offset = (4 - cls.size) & 3
    address >>= offset
    return cls(address), (index + 1)

mentalisttraceur commented 2 years ago

For comparison, here's the current macaddress._parse:

def _parse(string, *classes):
    length = len(string)
    if length < 1:
        raise ValueError('hardware address cannot be an empty string')
    candidates = {}
    for cls in classes:
        for format_ in cls.formats:
            if len(format_) == length:
                candidates.setdefault(format_, cls)
    candidates = sorted(candidates.items())
    address = 0
    start = 0
    end = len(candidates)
    for index in range(length):
        character = string[index]
        if character in _HEX_DIGITS:
            address <<= 4
            address += int(character, 16)
            character = 'x'
        elif character == 'x':
            character = ''
        while start < end and candidates[start][0][index] < character:
            start += 1
        while start < end and candidates[end - 1][0][index] > character:
            end -= 1
        if start >= end:
            raise _value_error(string, 'cannot be parsed as', *classes)
    _, cls = candidates[start]
    offset = (4 - cls.size) & 3
    address >>= offset
    return address, cls

mentalisttraceur commented 2 years ago

Notably, assuming characters is an iterator, we could replace:

    candidates = sorted(candidates.items())
    address = 0
    start = 0
    end = len(candidates)
    for index, character in enumerate(characters):

with

    length = max(map(len, candidates.keys()))
    candidates = sorted(candidates.items())
    address = 0
    start = 0
    end = len(candidates)
    for index in range(length):
        character = next(characters)

which is obviously more boilerplate but brings the code of the three partial/incremental parse functions closer in shape to the existing _parse.

mentalisttraceur commented 2 years ago

Notably, once we make them the same like that, the preambles of each function basically produce two things; length, candidates.

In the case of fixed-length input, length is the length of the input and candidates is just formats with that length.

In the case of unknown-length input, length is the length of the longest format and candidates is all candidates.

These two things could be pulled into their own functions:

def _parse_entire_string_preamble(string, classes):
    length = len(string)
    candidates = {}
    for cls in classes:
        for format_ in cls.formats:
            if len(format_) == length:
                candidates.setdefault(format_, cls)
    return length, sorted(candidates.items())

def _parse_from_front_preamble(classes):
    candidates = {}
    for cls in classes:
        for format_ in cls.formats:
            candidates.setdefault(format_, cls)
    length = max(map(len, candidates.keys()))
    return length = sorted(candidates.items())

I think we can remove the if length < 1: raise ValueError(...) case from the pre-amble, because the from-front-of-string code needs to handle "we ran out of bytes" cases too, and so we could handle them both by catching StopIteration.

mentalisttraceur commented 2 years ago

Could probably refactor all the parse-functions into a single generator which takes candidates, characters, and length, and then use little wrappers:

_parse would use the entire-string preamble, and feed string in as characters: there can only be one result or zero results, so we just use return next(iter(...)) on the generator.
parse_longest would loop getting each result and assigning it as the last result, catch the parsing error at the end, re-raise if there is no last result, or return the last result.
parse_longest_no_backtrack would be like parse_longest but without the try-catch.
parse_shortest would do return next(iter(...)) just like _parse.

The parse error would need to be tweaked (perhaps another reason against the over-helpful errors I've been doing, but in this case I should really give individual attention to what the best error messages in each case are).

mentalisttraceur commented 2 years ago

Ultimately though, if I was writing a parser, I would actually want this same functionality in an even more general way....

Because what these parse_* functions are doing is basically taking iterables of pieces (it just so happens in this case that the pieces are characters) and multiple alternatives of what sequences of pieces parse to what (which just so happens in this case to be specified with more characters where x has special meaning), and then using that.

So basically this is starting to approximate a parser combinator of sorts.

And ultimately when you're parsing iteratively you want different kinds of errors: if you expected characters and suddenly characters ended, you want to convey that explicitly; if you expected characters and got wrong characters, you want to convey that differently; and so on.

It does seem like an ideal solution maybe looks more like a parser and/or parser combinator library, and some sort of glue which can take macaddress classes and turn them and their formats definitions into equivalent stuff that can be passed into said parser (combinator) library.

mentalisttraceur commented 2 years ago

Verdict: dropping this idea until I one day have the time to either find a great parser library for Python which already exists, or to write my own (if so, one or more of these parse functions could probably be reused as a base for a function provided in that library).

Then this will be revisited to see how easy it is to glue that stuff together with macaddress classes.

mentalisttraceur / python-macaddress

Incremental Parsing? #19