Closed HaukurPall closed 4 years ago
One of the additions to Tokenizer 2.0 is to support multiple meanings for abbreviations. This means that a TOK.WORD
token corresponding to an abbreviation can now have multiple items in its val
list. In your example:
>>>> import tokenizer as t
>>>> s = t.tokenize("nr., gr., 1sti fyrsti, 1., 2ja, o.s.frv.")
>>>> for tt in s: print(tt)
Tok(kind=11001, txt=None, val=(0, None))
Tok(kind=6, txt='nr.', val=[('númer', 0, 'hk', 'skst', 'nr.', '-')])
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='gr.', val=[('grein', 0, 'kvk', 'skst', 'gr.', '-'), ('greinir', 0, 'kk', 'skst', 'gr.', '-'), ('greiðsla', 0, 'kvk', 'skst', 'gr.', '-'), ('grískur', 0, 'lo', 'skst', 'gr.', '-'), ('gramm', 0, 'hk', 'skst', 'gr.', '-')])
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='1sti', val=None)
Tok(kind=6, txt='fyrsti', val=None)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=10, txt='1.', val=1)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='2ja', val=None)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='o.s.frv', val=[('og svo framvegis', 0, 'ao', 'frasi', 'o.s.frv.', '-')])
Tok(kind=1, txt='.', val=(3, '.'))
Tok(kind=11002, txt=None, val=None)
>>>>
As is apparent, "gr." now comes with multiple (five) meanings, so simply selecting val[0]
is now an arbitrary choice. We do not have frequency data or other criteria to choose between meanings so there is no obvious way to sort them in within the list. We thus - taking our cue from the nondeterminism of Python dicts and sets - explicitly leave the priority sorting and decision to the user :-)
If you have suggestions on an alternate way to handle this, they are welcome.
Thank you for your reply and explanation.
An alternate way to handle this is to use the insertion order and thus use a list instead. It is not unheard of to maintain the insertion order:
Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6. Official Python documentation
But this choice would be rather arbitrary and I'm not suggesting that it should be done. I thought I would raise this issue in case this was something you wanted to do.
A case for having arbitrary ordering (my opinion of argument strength):
val
is a list, which implies some order from the user perspective. (strong)A case against arbitrary ordering:
This rather seems to indicate that val
should rather be a set
of tuples. But, again, I am not suggesting this should be changed. Rather, add a few lines into the documentation explaining the ordering of val
might change. It could be added in a few different sections.
TOK.WORD
explanation in the "The val
field" section.We have reconsidered this issue, and the order of abbreviation meanings should now be deterministic, as of version 2.0.1 of Tokenizer. Thanks for the input!
I noticed different handling of abbreviations between version 1.4.0 and 2.0.0 in a test case of mine.
test = "nr., gr., 1sti fyrsti, 1., 2ja, o.s.frv."
In particular, the handling of "gr." can differ between runs and I've seen it return one of:I know that the test case is out of context, in the sense that there is no correct answer out of these options. Regardless, I find the inconsistency of outputs troubling.
I briefly looked at the code and saw that "set()" is used to hold abbreviations which is probably the culprit.