Closed mentalisttraceur closed 2 years ago
Tentative verdict is to not bother supporting this, but to give it a big think.
Basically, I'd like to think up what a good implementation of this from outside this library looks like, especially one that still constructs instances of objects that this library provides.
So I think these are the functons (untested, and obviously the majority is repetitive bulk that could probably be factored out):
def parse_longest(characters, *classes):
candidates = {}
for cls in classes:
for format_ in cls.formats:
candidates.setdefault(format_, cls)
candidates = sorted(candidates.items())
address = 0
start = 0
end = len(candidates)
last_match = None
last_match_length = 0
for index, character in enumerate(characters):
if character in _HEX_DIGITS:
address <<= 4
address += int(character, 16)
character = 'x'
elif character == 'x':
character = ''
while start < end and candidates[start][0][index] < character:
start += 1
while start < end and candidates[end - 1][0][index] > character:
end -= 1
if start >= end:
break
if len(candidates[start][0]) == index + 1:
_, cls = candidates[start]
offset = (4 - cls.size) & 3
last_match = cls(address >> offset)
last_match_length = index + 1
start += 1
if last_match is None:
raise _value_error(string, 'cannot be parsed as', *classes)
return last_match, last_match_length
def parse_longest_no_backtrack(characters, *classes):
candidates = {}
for cls in classes:
for format_ in cls.formats:
candidates.setdefault(format_, cls)
candidates = sorted(candidates.items())
address = 0
start = 0
end = len(candidates)
for index, character in enumerate(characters):
if character in _HEX_DIGITS:
address <<= 4
address += int(character, 16)
character = 'x'
elif character == 'x':
character = ''
while start < end and candidates[start][0][index] < character:
start += 1
while start < end and candidates[end - 1][0][index] > character:
end -= 1
if start >= end:
raise _value_error(string, 'cannot be parsed as', *classes)
if len(candidates[start][0]) == index + 1:
if start == end - 1:
break
start += 1
_, cls = candidates[start]
offset = (4 - cls.size) & 3
address >>= offset
return cls(address), (index + 1)
def parse_shortest(characters, *classes):
candidates = {}
for cls in classes:
for format_ in cls.formats:
candidates.setdefault(format_, cls)
candidates = sorted(candidates.items())
address = 0
start = 0
end = len(candidates)
matches = []
for index, character in enumerate(characters):
if character in _HEX_DIGITS:
address <<= 4
address += int(character, 16)
character = 'x'
elif character == 'x':
character = ''
while start < end and candidates[start][0][index] < character:
start += 1
while start < end and candidates[end - 1][0][index] > character:
end -= 1
if start >= end:
raise _value_error(string, 'cannot be parsed as', *classes)
if len(candidates[start][0]) == index + 1:
break
_, cls = candidates[start]
offset = (4 - cls.size) & 3
address >>= offset
return cls(address), (index + 1)
For comparison, here's the current macaddress._parse
:
def _parse(string, *classes):
length = len(string)
if length < 1:
raise ValueError('hardware address cannot be an empty string')
candidates = {}
for cls in classes:
for format_ in cls.formats:
if len(format_) == length:
candidates.setdefault(format_, cls)
candidates = sorted(candidates.items())
address = 0
start = 0
end = len(candidates)
for index in range(length):
character = string[index]
if character in _HEX_DIGITS:
address <<= 4
address += int(character, 16)
character = 'x'
elif character == 'x':
character = ''
while start < end and candidates[start][0][index] < character:
start += 1
while start < end and candidates[end - 1][0][index] > character:
end -= 1
if start >= end:
raise _value_error(string, 'cannot be parsed as', *classes)
_, cls = candidates[start]
offset = (4 - cls.size) & 3
address >>= offset
return address, cls
Notably, assuming characters
is an iterator, we could replace:
candidates = sorted(candidates.items())
address = 0
start = 0
end = len(candidates)
for index, character in enumerate(characters):
with
length = max(map(len, candidates.keys()))
candidates = sorted(candidates.items())
address = 0
start = 0
end = len(candidates)
for index in range(length):
character = next(characters)
which is obviously more boilerplate but brings the code of the three partial/incremental parse functions closer in shape to the existing _parse
.
Notably, once we make them the same like that, the preambles of each function basically produce two things; length, candidates
.
In the case of fixed-length input, length
is the length of the input and candidates
is just formats with that length.
In the case of unknown-length input, length
is the length of the longest format and candidates
is all candidates.
These two things could be pulled into their own functions:
def _parse_entire_string_preamble(string, classes):
length = len(string)
candidates = {}
for cls in classes:
for format_ in cls.formats:
if len(format_) == length:
candidates.setdefault(format_, cls)
return length, sorted(candidates.items())
def _parse_from_front_preamble(classes):
candidates = {}
for cls in classes:
for format_ in cls.formats:
candidates.setdefault(format_, cls)
length = max(map(len, candidates.keys()))
return length = sorted(candidates.items())
I think we can remove the if length < 1: raise ValueError(...)
case from the pre-amble, because the from-front-of-string code needs to handle "we ran out of bytes" cases too, and so we could handle them both by catching StopIteration
.
Could probably refactor all the parse-functions into a single generator which takes candidates, characters, and length, and then use little wrappers:
_parse
would use the entire-string preamble, and feed string
in as characters
: there can only be one result or zero results, so we just use return next(iter(...))
on the generator.parse_longest
would loop getting each result and assigning it as the last result, catch the parsing error at the end, re-raise if there is no last result, or return the last result.parse_longest_no_backtrack
would be like parse_longest
but without the try-catch.parse_shortest
would do return next(iter(...))
just like _parse
.The parse error would need to be tweaked (perhaps another reason against the over-helpful errors I've been doing, but in this case I should really give individual attention to what the best error messages in each case are).
Ultimately though, if I was writing a parser, I would actually want this same functionality in an even more general way....
Because what these parse_*
functions are doing is basically taking iterables of pieces (it just so happens in this case that the pieces are characters) and multiple alternatives of what sequences of pieces parse to what (which just so happens in this case to be specified with more characters where x
has special meaning), and then using that.
So basically this is starting to approximate a parser combinator of sorts.
And ultimately when you're parsing iteratively you want different kinds of errors: if you expected characters and suddenly characters ended, you want to convey that explicitly; if you expected characters and got wrong characters, you want to convey that differently; and so on.
It does seem like an ideal solution maybe looks more like a parser and/or parser combinator library, and some sort of glue which can take macaddress
classes and turn them and their formats
definitions into equivalent stuff that can be passed into said parser (combinator) library.
Verdict: dropping this idea until I one day have the time to either find a great parser library for Python which already exists, or to write my own (if so, one or more of these parse
functions could probably be reused as a base for a function provided in that library).
Then this will be revisited to see how easy it is to glue that stuff together with macaddress
classes.
So one thing that comes up in real-world parsing, is parsing part of a string, up to the end of the next parsed thing.
In other words you might want something like this:
And of course you want some sort of sensible error behavior.
Notably, right now
macaddress
doesn't help you do this. It assumes you already have strings which are the right size to pass in to the class constructor orparse
.Of course, neither does
ipaddress
... is this a deficit on the part ofipaddress
? Or should this kind of functionality be provided somehow by higher-level wrapper libraries?Notably:
If an incremental parsing library has to reuse
ipaddress
ormacaddress
as it is right now, then they must necessarily take on an inefficiency:Imagine you have a string that you want to parse for the next "token" being either an EUI48 or OUI literal. (Maybe you're writing a mini language for working with networking settings, like for scripting a router or something.)
(Quick aside: when parsing stuff from the front of something longer, parses might want to do either a greedy match or conservative match: if you see the string
01-02-03-04-05-06
, do you conservatively match justOUI('01-02-03')
and leave-04...
unconsumed, or do you greedily match the wholeEUI48('01-02-03-04-05-06')
? In this use-case I think it's obvious that you'd want to do the greedy match, but in some cases you might want the conservative one.)So the point is, from an efficiency perspective, ideally the parsing consumes
01-02-03
, and then decides based on the next character whether or not it is done (and has parsed anOUI
instance) or needs to continue to try to parse a fullEUI48
instance.(There's another permutation: if the next character is
-
, then sometimes you want to consider theOUI
possibility as eliminated, so that if the rest doesn't parse into a fullEUI48
, it's now an error, and other times you want to consider theOUI
possibility still "live", so that if the greedy match fails, you treat it as if you just parsed anOUI
instance and stopped there, so that the next step of the parser can read from right after the OUI.)In all of these cases though, it kinda makes sense to me to parse all the possibilities incrementally - check each character against all parse possibilities - the way I do in
macaddress._parse
. And while the logic ofmacaddress._parse
assumes a string containing one complete address, and only tries the formats which match the size, it's probably pretty close to what an incremental parser would want in that case.So it seems like a shame that I've written some careful logic, which can parse canonical and custom formats into all these hardware address types represented as distinctly typed objects in memory, and yet if someone wanted to write parser which parses hardware addresses from the next characters in a string, they'd basically have to reimplement a lot of what I did, or use what I did at the cost of a lot of boilerplate and inefficiency.