mideind / Tokenizer

A tokenizer for Icelandic text
Other
27 stars 6 forks source link

Inconsistent application of abbreviation expansion #11

Closed HaukurPall closed 4 years ago

HaukurPall commented 4 years ago

I noticed different handling of abbreviations between version 1.4.0 and 2.0.0 in a test case of mine. test = "nr., gr., 1sti fyrsti, 1., 2ja, o.s.frv." In particular, the handling of "gr." can differ between runs and I've seen it return one of:

I know that the test case is out of context, in the sense that there is no correct answer out of these options. Regardless, I find the inconsistency of outputs troubling.

I briefly looked at the code and saw that "set()" is used to hold abbreviations which is probably the culprit.

vthorsteinsson commented 4 years ago

One of the additions to Tokenizer 2.0 is to support multiple meanings for abbreviations. This means that a TOK.WORD token corresponding to an abbreviation can now have multiple items in its val list. In your example:

>>>> import tokenizer as t
>>>> s = t.tokenize("nr., gr., 1sti fyrsti, 1., 2ja, o.s.frv.")
>>>> for tt in s: print(tt)
Tok(kind=11001, txt=None, val=(0, None))
Tok(kind=6, txt='nr.', val=[('númer', 0, 'hk', 'skst', 'nr.', '-')])
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='gr.', val=[('grein', 0, 'kvk', 'skst', 'gr.', '-'), ('greinir', 0, 'kk', 'skst', 'gr.', '-'), ('greiðsla', 0, 'kvk', 'skst', 'gr.', '-'), ('grískur', 0, 'lo', 'skst', 'gr.', '-'), ('gramm', 0, 'hk', 'skst', 'gr.', '-')])
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='1sti', val=None)
Tok(kind=6, txt='fyrsti', val=None)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=10, txt='1.', val=1)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='2ja', val=None)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='o.s.frv', val=[('og svo framvegis', 0, 'ao', 'frasi', 'o.s.frv.', '-')])
Tok(kind=1, txt='.', val=(3, '.'))
Tok(kind=11002, txt=None, val=None)
>>>>

As is apparent, "gr." now comes with multiple (five) meanings, so simply selecting val[0] is now an arbitrary choice. We do not have frequency data or other criteria to choose between meanings so there is no obvious way to sort them in within the list. We thus - taking our cue from the nondeterminism of Python dicts and sets - explicitly leave the priority sorting and decision to the user :-)

If you have suggestions on an alternate way to handle this, they are welcome.

HaukurPall commented 4 years ago

Thank you for your reply and explanation.

An alternate way to handle this is to use the insertion order and thus use a list instead. It is not unheard of to maintain the insertion order:

Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6. Official Python documentation

But this choice would be rather arbitrary and I'm not suggesting that it should be done. I thought I would raise this issue in case this was something you wanted to do.

A case for having arbitrary ordering (my opinion of argument strength):

A case against arbitrary ordering:

This rather seems to indicate that val should rather be a set of tuples. But, again, I am not suggesting this should be changed. Rather, add a few lines into the documentation explaining the ordering of val might change. It could be added in a few different sections.

vthorsteinsson commented 4 years ago

We have reconsidered this issue, and the order of abbreviation meanings should now be deterministic, as of version 2.0.1 of Tokenizer. Thanks for the input!