own-pt / mill

crunch textual wordnet data
Apache License 2.0
2 stars 1 forks source link

proposal: get rid of lexical ids #42

Closed odanoburu closed 4 years ago

odanoburu commented 4 years ago

40 discusses the identification of satellite adjectives, which in PWN differ from the identification scheme used by other kinds of synsets. in mill we tried to make this identification scheme more regular by having the satellite adjectives use the same scheme as every other kind of synset (which ultimately led to the issue discussed in #40).

a more radical idea is to stop trying to have satellite adjectives behave like other synsets, and have the other synsets behave like satellite adjectives. what follows is a proposal for a new wordsense ID scheme.

lexical ids have no meaning whatsoever; they are solely an ad hoc way of preventing ID clashes, because the combination (lexical form, lexicographer file) is not enough to uniquely determine a wordsense. we could get rid of lexical ids by generalizing a version of the ID scheme formerly used by adjective satellites, which can be uniquely identified by (lexical form, head synset).

nouns and verbs could be identified by (lexfile, lexical form, hypernym) (or hyponym?) pertainyms (adjective or adverbs) could be identified by which wordsense they pertain to (plus lexfile and lexical form).

all in all, we define a 'core' relation for each 'kind' of wordsense/synset, and use the relation's target + the lexical form of the source to identify the wordsense/synset. naturally, mill would have to be able to verify the uniqueness of this naming scheme. naturally, we wouldn't have to identify the core target beyond its lexical form unless that's not sufficient to satisfy the uniqueness constraint.

but this is all very radical, so I don't know if it should be implemented.

odanoburu commented 4 years ago

if this is to be implemented, we need to take care of these cases:

select distinct ?ws1 ?ws2
where {
  ?s1 rdf:type ns1:NounSynset .
  ?s2 rdf:type ns1:NounSynset .
  ?s1 ns1:lexicographerFile ?lf .
  ?s2 ns1:lexicographerFile ?lf .
  ?s1 ns1:containsWordSense ?ws1 .
  ?s2 ns1:containsWordSense ?ws2 .
  ?s1 ns1:hyponymOf ?s .
  ?s2 ns1:hyponymOf ?s .
  ?ws1 rdfs:label ?l .
  ?ws2 rdfs:label ?l .
  filter (?s1 != ?s2)
} limit 10

that is, we can't really assume a 'core' relation; any relation will have to do.

odanoburu commented 4 years ago

we have analysed a few cases and it seems that most ambiguous references are arguably errors in PWN. in order to be able to differentiate them before correction, we can add a placeholder/marker relation (aptly named lexicalId) pointing to the synset of an ordinal.

if any case arises where we can't differentiate because our scheme is too restrictive (we demand at least one positive discriminator), we can decide to add negative discriminators (idea emerged when talking to @hmuniz) . in the case below,

w: volume drf adjs.all:voluminous(sim big)
d: the amount of 3-dimensional space occupied by an object
e: the gas expanded to twice its original volume
hyper: content
hypo: noun.Tops:measure
mp: cubic_measure

w: volume
d: a relative amount
e: mix one volume of the solution with ten volumes of water
hypo: noun.Tops:measure

the former volume can be distinguished by the presence of a hyper relation to content. but the latter has no positive discriminator; if we allow negative discriminators, we could refer to it by saying 'the sense of volume which has no hyper relation to content'

odanoburu commented 4 years ago

we have given up on this for now; the instability of sense keys is a major downside, and it is much more complex to implement (needs more support for editing too)