globalwordnet / english-wordnet

The Open English WordNet
https://en-word.net/
Other
465 stars 56 forks source link

Missing sense keys (dc.identifier) #157

Closed ekaf closed 4 years ago

ekaf commented 5 years ago

In english-wordnet-2019.xml as well as in the lexicographic src directory, 1077 lemmas do not have a sense key (dc.identifier). Mostly, these correspond to the additions made in this WordNet version, and are very probably the reason for 681 missing sense keys in the index.sense file, and may also explain the 28 overloaded lex_ids in the Princeton-compatible data files.

jmccrae commented 5 years ago

I might slightly turn this question around. Do we need sense keys? While it makes sense for backwards compatibility with Princeton WordNet, we already have a scheme of identifying senses with XML IDs and I am not sure if I want to start coining new sense keys, when the mechanism we have is not really related to Princeton WordNet's method. Moreover, as Christiane Fellbaum insists that sense keys are a stable cross-version mapping method for WordNets, if we start coining sense keys then that will create some compatibility issues, as we may coin different sense keys than PWN would

ekaf commented 5 years ago

It depends on how much Princeton-compatibility is needed by the intended users.

Full compatibility could be nice, but difficult to maintain because of the many subtle pitfalls. Sense keys are specially desirable because of their stability across versions. On the other hand it is possible (though unproven) that ILI identifiers might provide an equivalent level of stability. If the ILI provides a stable synset identifier, appending this to the lemma creates a stable sense key, since there is only one occurrence of each lemma in its synonym set.

Fixing the 2019 release is not too hard, because there are no new synsets. Since the added lemmas are members of existing WN 3.1 synsets, and thus have a known lexical file number, providing Princeton-style sense keys for them only requires assigning appropriate lex_ids. This is 0 when the new lemma was not already in the lexical file. But when the lemma was already present, assigning a new lex_id requires checking all previous PWN releases, to avoid re-using a lex_id that has been removed in the past, which would break the referential integrity of the resulting sense key.

Things could become more difficult when adding new synsets, because these will need new identifiers. Using ILI identifiers as the basis for a new sense-key format seems a reasonable and easier alternative.

ekaf commented 5 years ago

The Sense Key Index contains a handy db with all the sense keys for all PWN versions between 1.5 and 3.1.1, so I used https://github.com/ekaf/ski/blob/master/ski-pwn-sets.txt.gz to look up the 28 overloaded keys from the English WordNet 2019, and found that 6 were not assigned within the same lexical file in the past, so it would be possible to assign lex_id "0" for one sense, and lex_id "1" for the other:

arthur_m._schlesinger%1:18:00:: 11323254,11323420
battle_of_lake_trasimene%1:04:00:: 01286792,01302087
black_september_organization%1:14:00:: 08039618,08041284
guanyin%1:18:00:: 09566253,09567513
pantalone%1:18:00:: 09634967,09635105
turkistan_islamic_party%1:14:00:: 08040929,08047048

The remaining 22 overloaded keys already existed in the specified lexical file, so it is not possible to use lex_id 0 for them without a manual lexicographic check. The majority were still valid in PWN 3.1.1, so assigning the lowest available lex_id (often "1") would be appropriate for the new sense.

Only 4 new word-lexfile pairs had lex_ids that were dropped in past updates. For example, the most complex case is the noun "shot", with 9 previous senses in the lexical file noun.act: 7 of these still persisted in PWN 3.1.1, with the lex_ids 00, 03, 04, 06, 07, 08 and 09, so these cannot be used. On the other hand, lex_ids 02 and 05 were dropped in the PWN 1.6 update, and it is possible (though not very likely) that these could correspond to the new sense.

In total, this boils down to asking whether the following sensekeys from past PWN versions might correspond to a new sense. Often, this is probably not the case, so in principle it will not be possible to re-use these lex_ids neither:

labyrinthine%5:00:01:complex:00 1.5:s:01669441, 1.5sc:s:01960729, 1.6:s:02092326, 1.7:s:02112018, 1.7.1:s:02113463, 2.0:s:02102832 shot%1:04:02:: 1.5:n:00062603, 1.5sc:n:00067858 shot%1:04:05:: 1.5:n:00283778, 1.5sc:n:00303901 silent%5:00:02:inarticulate:00 1.5:s:00123702 strike%2:35:04:: 1.5: v:00707926, 1.5sc: v:00759159, 1.6: v:00846512

Thus, it seems possible that full Princeton-compatibility for this release could be achieved without too much difficulty.