Inconsistent application of abbreviation expansion

HaukurPall commented 4 years ago

I noticed different handling of abbreviations between version 1.4.0 and 2.0.0 in a test case of mine. test = "nr., gr., 1sti fyrsti, 1., 2ja, o.s.frv." In particular, the handling of "gr." can differ between runs and I've seen it return one of:

"greinir"
"grein"
"grískur"
"greiðsla"

I know that the test case is out of context, in the sense that there is no correct answer out of these options. Regardless, I find the inconsistency of outputs troubling.

I briefly looked at the code and saw that "set()" is used to hold abbreviations which is probably the culprit.

vthorsteinsson commented 4 years ago

One of the additions to Tokenizer 2.0 is to support multiple meanings for abbreviations. This means that a TOK.WORD token corresponding to an abbreviation can now have multiple items in its val list. In your example:

>>>> import tokenizer as t
>>>> s = t.tokenize("nr., gr., 1sti fyrsti, 1., 2ja, o.s.frv.")
>>>> for tt in s: print(tt)
Tok(kind=11001, txt=None, val=(0, None))
Tok(kind=6, txt='nr.', val=[('númer', 0, 'hk', 'skst', 'nr.', '-')])
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='gr.', val=[('grein', 0, 'kvk', 'skst', 'gr.', '-'), ('greinir', 0, 'kk', 'skst', 'gr.', '-'), ('greiðsla', 0, 'kvk', 'skst', 'gr.', '-'), ('grískur', 0, 'lo', 'skst', 'gr.', '-'), ('gramm', 0, 'hk', 'skst', 'gr.', '-')])
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='1sti', val=None)
Tok(kind=6, txt='fyrsti', val=None)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=10, txt='1.', val=1)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='2ja', val=None)
Tok(kind=1, txt=',', val=(3, ','))
Tok(kind=6, txt='o.s.frv', val=[('og svo framvegis', 0, 'ao', 'frasi', 'o.s.frv.', '-')])
Tok(kind=1, txt='.', val=(3, '.'))
Tok(kind=11002, txt=None, val=None)
>>>>

As is apparent, "gr." now comes with multiple (five) meanings, so simply selecting val[0] is now an arbitrary choice. We do not have frequency data or other criteria to choose between meanings so there is no obvious way to sort them in within the list. We thus - taking our cue from the nondeterminism of Python dicts and sets - explicitly leave the priority sorting and decision to the user :-)

If you have suggestions on an alternate way to handle this, they are welcome.

HaukurPall commented 4 years ago

Thank you for your reply and explanation.

An alternate way to handle this is to use the insertion order and thus use a list instead. It is not unheard of to maintain the insertion order:

Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6. Official Python documentation

But this choice would be rather arbitrary and I'm not suggesting that it should be done. I thought I would raise this issue in case this was something you wanted to do.

A case for having arbitrary ordering (my opinion of argument strength):

Might make testing easier. (weak - since tests can be adjusted to deal with this)
Might be a simpler way of picking out "the correct" meaning. (weak)
The val is a list, which implies some order from the user perspective. (strong)

A case against arbitrary ordering:

You don't want to be bound to a particular ordering and preserve that ordering between versions. (strong)
It might hide the fact from the user that there could be multiple possible abbreviations. To figure out this the user would have to have a similar test as I had. (strong)

This rather seems to indicate that val should rather be a set of tuples. But, again, I am not suggesting this should be changed. Rather, add a few lines into the documentation explaining the ordering of val might change. It could be added in a few different sections.

In the Deep tokenization example, which seems to be outdated as it contains the "gr." example but only a single possible abbreviation.
For the TOK.WORD explanation in the "The val field" section.
In the "Abbreviations" section use the "gr." example.

vthorsteinsson commented 4 years ago

We have reconsidered this issue, and the order of abbreviation meanings should now be deterministic, as of version 2.0.1 of Tokenizer. Thanks for the input!

mideind / Tokenizer

Inconsistent application of abbreviation expansion #11