vidraj / derinet

The main repository for the DeriNet project and all its dependencies and related tools.
https://ufal.mff.cuni.cz/derinet
7 stars 3 forks source link

Conceptual problems with Lexeme identity and sorting #24

Open vidraj opened 3 months ago

vidraj commented 3 months ago

When saving the database, the lexemes have to be printed in sorted order to make the process deterministic. To achieve this, we define __lt__ on the Lexeme class, which is documented to be the only rich comparison required by the built-in sorting routines. The sorting takes into account the lemma, POS, lemid and techlemma, which makes the sorting fully deterministic for the Czech data, as no two lexemes have an identical (techlemma, POS) pair.

However, it still leaves us with two issues:

  1. Some languages might not have techlemmas, but might have homonyms with identical POS-tags. This would result in nondeterministic database serialization.
  2. The Python style guide strongly recommends to define all 6 rich comparison operators even though you only need one.

Resolving these seems impossible, though.


We currently have three rich comparison operators defined – __lt__ as above, and __eq__ and __ne__ using the built-in object identity method that defers to is. We could also define __gt__ as the inverse of __lt__. But the __lt__ we have doesn't form a total order with this definition of __eq__, because if we have lexemes x and y, it can happen that (not x < y) and (not y < x), but at the same time x != y, if they are homonyms. So we need to change one of the comparisons.

There is no possible definition of __lt__ that would form a total order with this __eq__, because if you create Lexeme("unlockable", "ADJ") and Lexeme("unlockable", "ADJ"), how do you distinguish between them? They are not is and therefore also not __eq__, but there is no rule that would make one naturally __lt__ the other while surviving database reloads.

And the definition of __eq__ must not be changed, because we want to be able to stuff Lexemes into sets – that's the way we use to e.g. keep track of visited lexemes when iterating through the DB. Otherwise homonyms would coalesce when inserted into a set. In other words, the lexemes above must be unequal – they seem the same now, but one could be changed to have different derivational ancestry from the other, without changing the __eq__ result (variable __eq__ would imply unhashable type).


So it seems we can't define a total order, and therefore should avoid defining __le__ and __ge__? I'll leave this issue here in case someone figures something out.