own-pt / mill

crunch textual wordnet data
Apache License 2.0
2 stars 1 forks source link

sense key backwards compatibility #40

Closed odanoburu closed 4 years ago

odanoburu commented 4 years ago

in mill we decided to have a uniform syntax for synset description (no different syntax for adjective cluster synsets), with each wordsense having a unique ID in the same wordnet, given by its lexicographer file, lexical form and lexical ID. in the PWN this ID scheme does not work; there are several adjective satellite wordsenses which share the same mill ID, as this query shows.

we 'solved' this problem by splitting the adjective satellites in their own lexicographer file (adjs.all; although this is not hardcoded in mill, but controlled by a configuration file) and updating their lexical IDs, which broke the backward-compatibility of sense keys as defined in the PWN documentation.

odanoburu commented 4 years ago

note: even if we lose exact backwards incompatibility, we can always determine which wordsense a sense key is pointing to with only the information present in it: we can safely discard the lexical id of the satellite synset, since no head synset has two satellite adjectives with wordsenses with the same lexical form, see query:

select distinct ?s
where {
  ?s rdf:type ns1:AdjectiveSynset .
  ?s ns1:similarTo ?ss1 .
  ?s ns1:similarTo ?ss2 .
  ?ss1 ns1:word ?w .
  ?ss2 ns1:word ?w .
  filter (?ss1 != ?ss2)
  }

naturally, we'll need to use this query as a test to ensure that this remains the case after edits.

odanoburu commented 4 years ago

we have since decided not to split satellite adjectives into their own lexicographer file, but make the similarTo relation one-way instead. we still have a problem of backwards incompatibility for adjective synsets, since they don't adhere to the rule of only one pair of (lexical form, lexical_id) per lexicographer file with or without the separation in two files, so we create a mapping between PWN sense keys and mill sense keys when bootstrapping the data. mill sense keys should be more stable than PWN sense keys since they are semantically the same except for the exclusion of head word and id in the case of satellite adjectives (see wiki for more info)