globalwordnet / cili

The Global WordNet Association Collaborative Inter-Lingual Index
Other
38 stars 8 forks source link

about the mappings PWN30 and PWN31 #16

Open arademaker opened 1 year ago

arademaker commented 1 year ago

where is the construction of the mappings from CILI to PWN30 and PWN31 documented?

arademaker commented 1 year ago
% rg ^03021531 WordNet-3.0/dict/data.*
WordNet-3.0/dict/data.noun
16359:03021531 06 n 02 chlorambucil 0 Leukeran 0 002 @ 02697438 n 0000 ;u 06845599 n 0201 | an alkalating agent (trade name Leukeran) used to treat some kinds of cancer

% rg ^03025214 WordNet-3.1-dict/data.*
WordNet-3.1-dict/data.noun
16369:03025214 06 n 02 chlorambucil 0 Leukeran 0 002 @ 02700297 n 0000 ;u 06858649 n 0201 | an alkylating agent (trade name Leukeran) used to treat some kinds of cancer

% rg 03021531 WordNet-3.0/dict/index.sense
33222:chlorambucil%1:06:00:: 03021531 1 0
106538:leukeran%1:06:00:: 03021531 1 0

% rg 03025214 WordNet-3.1-dict/index.sense
33231:chlorambucil%1:06:00:: 03025214 1 0
106656:leukeran%1:06:00:: 03025214 1 0

% rg "03021531-n|03025214-n" cili/ili-map-p*
cili/ili-map-pwn30.tab
51874:i51874    03021531-n

So PWN3.0 03021531-n should be mapped to PWN 3.1 03025214-n, right? But i51874 only maps to the PWN3.0

arademaker commented 1 year ago

In another case, the gloss changed, but it seems to be the same concept:

% rg ^04231905 WordNet-3.0/dict/data.*
WordNet-3.0/dict/data.noun
23507:04231905 06 n 01 Skivvies 0 002 @ 04508949 n 0000 ;u 06851742 n 0000 | men's underwear consisting of cotton T-shirt and shorts

% rg ^04238967 WordNet-3.1-dict/data.*
WordNet-3.1-dict/data.noun
23532:04238967 06 n 01 skivvies 0 003 @ 04516244 n 0000 ;u 06864792 n 0000 ;u 06306016 n 0000 | (used in the plural) men's underwear consisting of cotton undershirt and underpants

% rg 04231905 WordNet-3.0/dict/index.sense
168480:skivvies%1:06:00:: 04231905 1 0

% rg 04238967 WordNet-3.1-dict/index.sense
168716:skivvies%1:06:00:: 04238967 1 0

% rg "04231905-n|04238967-n" cili/ili-map-p*
cili/ili-map-pwn30.tab
59022:i59022    04231905-n
arademaker commented 1 year ago
% rg ^02440996 WordNet-3.0/dict/data.*
WordNet-3.0/dict/data.adj
13550:02440996 00 s 01 inferior 0 002 & 02440691 a 0000 ;c 06057539 n 0000 | lower than a given reference point; "inferior alveolar artery"

% rg ^02450200 WordNet-3.1-dict/data.*
WordNet-3.1-dict/data.adj
13568:02450200 00 s 01 inferior 0 002 & 02449895 a 0000 ;c 06067070 n 0000 | lower than a given reference point; "inferior alveolar artery"

% rg 02440996 WordNet-3.0/dict/index.sense
95599:inferior%5:00:00:bottom:00 02440996 5 0

% rg 02450200 WordNet-3.1-dict/index.sense
95695:inferior%5:00:00:bottom:00 02450200 5 0

% rg "02440996-s|02450200-s" cili/ili-map-p*
cili/ili-map-pwn30.tab
13521:i13521    02440996-s
arademaker commented 1 year ago

The last one

% rg ^01827261 WordNet-3.0/dict/data.*
WordNet-3.0/dict/data.adj
10064:01827261 00 s 01 regent(ip) 0 004 & 01825671 a 0000 ;u 06307152 n 0000 + 10516117 n 0101 + 00598970 n 0101 | acting or functioning as a regent or ruler; "prince-regent"

% rg ^01832979 WordNet-3.1-dict/data.*
WordNet-3.1-dict/data.adj
10068:01832979 00 s 01 regent(ip) 0 004 & 01831389 a 0000 ;u 06318142 n 0000 + 10535710 n 0101 + 00600085 n 0101 | acting or functioning as a regent or ruler; "prince-regent"

% rg 01827261 WordNet-3.0/dict/index.sense
151759:regent%5:00:00:powerful:00 01827261 1 0

% rg 01832979 WordNet-3.1-dict/index.sense
151962:regent%5:00:00:powerful:00 01832979 1 0

% rg "01827261-s|01832979-s" cili/ili-map-p*
cili/ili-map-pwn30.tab
10035:i10035    01827261-s

I can make a PR to change the ili-map-pwn31.tab file if someone can confirm the errors or justify the difference.

jmccrae commented 1 year ago

I can't really explain this, as the mapping was completed by PWN senses, so these should be linked.

I see 274 'new' senses in PWN 3.1 according to OEWN and some of these are genuinely new (e.g. 'Barack Obama') others don't seem to be. If you are capable of identifying these automatically it would be a great help

goodmami commented 1 year ago

FWIW, the first two are listed as deprecated in changes-in-wn31.csv:

$ grep -P 'i51874|i59022|i13521|i10035' changes-in-wn31.csv
deprecated,ili:i51874,03021531-n,none,chlorambucil/Leukeran
deprecated,ili:i59022,04231905-n,none,Skivvies

The other two words are in the file under a different ILI:

grep -P 'inferior|regent' changes-in-wn31.csv 
deprecated,ili:i13656,01827261-s,none,regent
deprecated,ili:i17142,02440996-s,none,inferior
new,,,01832979-a,regent(ip)
new,,,02450200-a,inferior

@fcbond do you know what is the story here?

arademaker commented 1 year ago
M 00929443-s => 1 1 False {('00932684-s', 0)} {'00041424-s'}
 WN30 00929443-s {'dead%5:00:00:extinct:01'}
  not surviving in active use; "Latin is a dead language" 
 WN31 00932684-s {'dead%5:00:00:extinct:01'}
  not surviving in active use; "Latin is a dead language"
 WN31 00041424-s {'dead%5:00:00:extinct:02'}
  physically inactive; "Crater Lake is in the crater of a dead volcano of the Cascade Range"

The mapping says that two concepts from PWN30 were merged in PWN31. But 00041202-s in PWN30 is actually 00041202-a.

% rg "i208\t|i5092\t|\t00041424-s" ../cili/ili-map-p*
../cili/ili-map-pwn30.tab
208:i208    00041202-s
5092:i5092  00929443-s

../cili/ili-map-pwn31.tab
206:i208    00041424-s
5084:i5092  00041424-s

00929443-s should map to 00932684-s

arademaker commented 1 year ago
M 10210648-n => 2 1 False {('10230422-n', 2), ('10230249-n', 4)} {'10230249-n'}
 WN30 10210648-n {'interior_decorator%1:18:00::', 'room_decorator%1:18:00::', 'designer%1:18:02::', 'decorator%1:18:01::', 'house_decorator%1:18:00::', 'interior_designer%1:18:00::'}
  a person who specializes in designing architectural interiors and their furnishings 
 WN31 10230249-n {'interior_designer%1:18:00::', 'designer%1:18:02::'}
  a person who specializes in interior design
 WN31 10230422-n {'decorator%1:18:01::', 'room_decorator%1:18:00::', 'house_decorator%1:18:00::', 'interior_decorator%1:18:00::'}
  a person who specializes in interior decoration

The concept in PWN30 was split into two in PWN31, right? But the mapping is not reflecting that:

% rg "i90722\t|\t10210648-n" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
90654:i90722    10230249-n

../cili/ili-map-pwn30.tab
90722:i90722    10210648-n

I would say that 10210648-n maps to both 10230249-n and 10230422-n in PWN31.

Same for

M 00040058-s => 2 1 False {('00040305-s', 2), ('00040189-s', 1)} {'00040189-s'}
 WN30 00040058-s {'supine%5:00:00:passive:01', 'unresisting%5:00:00:passive:01', 'resistless%5:00:00:passive:01'}
  offering no resistance; "resistless hostages"; "No other colony showed such supine, selfish helplessness in allowing her own border citizens to be mercilessly harried"- Theodore Roosevelt 
 WN31 00040189-s {'unresisting%5:00:00:passive:01', 'resistless%5:00:00:passive:01'}
  offering no resistance; "resistless hostages"
 WN31 00040305-s {'supine%5:00:00:passive:01'}
  passive as a result of indolence or indifference; "No other colony showed such supine, selfish helplessness in allowing her own border citizens to be mercilessly harried"- Theodore Roosevelt

and also

 WN30 00949619-n {'engineering%1:04:01::', 'technology%1:04:00::'}
  the practical application of science to commerce or industry 
 WN31 00951878-n {'engineering%1:04:01::'}
  the practical application of technical and scientific knowledge to commerce or industry
 WN31 00951435-n {'technology%1:04:00::'}
  the application of the knowledge and usage of tools (such as machines or utensils) and techniques to control one's environment; "the mastery of fire was a huge advance in human technology"
arademaker commented 1 year ago
M 06823760-n => 2 1 False {('06836640-n', 2), ('06836790-n', 1)} {'06836640-n'}
 WN30 06823760-n {'umlaut%1:10:00::', 'dieresis%1:10:00::', 'diaeresis%1:10:00::'}
  a diacritical mark (two dots) placed over a vowel in German to indicate a change in sound 
 WN31 06836640-n {'umlaut%1:10:00::'}
  a diacritical mark (two dots) placed over a vowel to indicate a change in sound in some languages
 WN31 06836790-n {'dieresis%1:10:00::', 'diaeresis%1:10:00::'}
  a diacritical mark (two dots) placed over a vowel to indicate that it does not form a diphthong with an adjacent vowel

Is this the same case as above?

% rg "i72354\t|\t06823760" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
72312:i72354    06836640-n

../cili/ili-map-pwn30.tab
72354:i72354    06823760-n

We can say the sense was split, so the PWN30 synset needs to map both synsets in PW31.

= 06823760-n 06836640-n = 06823760-n 06836790-n

Or we can say that none of the new synsets are replacements for the old PWN30 synset; they are generalizations. So PWN30 is <= both PWN31.

That would force us to extend the mapping to deal with more fine-grained relations rather than only equality. BTW, can someone see the reason for splitting this sense from PWN30?

arademaker commented 1 year ago

Here the mapping seems right, but the PWN31 structure can be changed:

M 10012484-n => 2 1 False {('10032138-n', 2), ('10032289-n', 1)} {'10032138-n'}
 WN30 10012484-n {'nutritionist%1:18:00::', 'dietician%1:18:00::', 'dietitian%1:18:00::'}
  a specialist in the study of nutrition 
 WN31 10032289-n {'dietician%1:18:00::', 'dietitian%1:18:00::'}
  a specialist in the study of diet and nutrition
 WN31 10032138-n {'nutritionist%1:18:00::'}
  a specialist in the study of nutrition

If someone is a nutritionist, he/she is also a dietician, right? Because X ∧ Y ⊆ X, so 10032289-n is a hyponym of 10032138-n? But they are sisters in PWN 31.

arademaker commented 1 year ago
M 00042692-s => 1 1 False {('00042912-s', 0)} {'00035037-s'}
 WN30 00042692-s {'activated%5:00:00:active:07'}
  rendered active; e.g. rendered radioactive or luminescent or photosensitive or conductive 
 WN31 00035037-s {'activated%5:00:00:active:08'}
  (military) set up and placed on active assignment; "a newly activated unit"
 WN31 00042912-s {'activated%5:00:00:active:07'}
  rendered active; e.g. rendered radioactive or luminescent or photosensitive or conductive

The mapping is wrong. it points 00042692-s to 00035037-s but 00042912-s is more appropriate, right?

% rg "i216\t|\t00042692-s" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
214:i216    00035037-s

../cili/ili-map-pwn30.tab
216:i216    00042692-s
arademaker commented 1 year ago
M 02312060-s => 1 1 False {('02319740-s', 0)} {'02319740-a'}
 WN30 02312060-s {'akimbo%5:00:00:crooked:01'}
  (used of arms and legs) bent outward with the joint away from the body; "a tailor sitting with legs akimbo"; "stood with arms akimbo" 
 WN31 02319740-a set()
  (used of arms and legs) bent outward with the joint away from the body; "a tailor sitting with legs akimbo"; "stood with arms akimbo"
 WN31 02319740-s {'akimbo%5:00:00:crooked:01'}
  (used of arms and legs) bent outward with the joint away from the body; "a tailor sitting with legs akimbo"; "stood with arms akimbo"

There is no 02319740-a in PWN31, but 02319740-s, it is a satellite synset.

% rg ^02319740 ../WordNet-3.1-dict/data.adj
12840:02319740 00 s 01 akimbo(ip) 0 001 & 02319224 a 0000 | (used of arms and legs) bent outward with the joint away from the body; "a tailor sitting with legs akimbo"; "stood with arms akimbo"
arademaker commented 1 year ago
M 00675928-s => 1 1 False {('00679196-s', 0)} {'00679361-s'}
 WN30 00675928-s {'alternating%5:00:01:cyclic:01', 'alternate%5:00:01:cyclic:01'}
  occurring by turns; first one and then the other; "alternating feelings of love and hate" 
 WN31 00679196-s {'alternating%5:00:01:cyclic:01', 'alternate%5:00:01:cyclic:01'}
  occurring by turns; first one and then the other; "alternating feelings of love and hate"
 WN31 00679361-s {'alternate%5:00:02:cyclic:01'}
  every second one of a series; "the cleaning lady comes on alternate Wednesdays"; "jam every other day"- the White Queen

Once more, it seems that the mapping is wrong. 00675928-s from PWN30 is 00679196-s in PWN31 not 00679361-s.

% rg "i3764\t|\t00675928-s" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
3758:i3764  00679361-s

../cili/ili-map-pwn30.tab
3764:i3764  00675928-s
arademaker commented 1 year ago
M 02713992-n => 2 0 False {('02716929-n', 1), ('02716785-n', 1)} set()
 WN30 02713992-n {'roundel%1:06:01::', 'annulet%1:06:02::'}
  (heraldry) a charge in the shape of a circle; "a hollow roundel" 
 WN31 02716785-n {'roundel%1:06:01::'}
  (heraldry) a charge in the shape of a filled circle; "a hollow roundel"
 WN31 02716929-n {'annulet%1:06:02::'}
  (heraldry) a charge in the shape of a small ring

The mapping is missing 02713992-n to 02716929-n, or maybe we can map to both 02716785-n and 02716929-n. but considering the example and definition, mapping to 02716785-n looks better.

arademaker commented 1 year ago

The mapping is clearly wrong. The 02599754-v from PWN30 should maps to 02605751-v in PWN31

M 02599754-v => 1 1 False {('02605751-v', 0)} {'00680696-v'}
 WN30 02599754-v {'book%2:41:03::'}
  register in a hotel booker 
 WN31 02605751-v {'book%2:41:03::'}
  register in a hotel booker
 WN31 00680696-v {'book%2:31:00::'}
  engage for a performance; "Her agent had booked her for several concerts in Tokyo"
% rg "i34688\t|\t02599754-v" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
34655:i34688    00680696-v

../cili/ili-map-pwn30.tab
34688:i34688    02599754-v
arademaker commented 1 year ago

The mapping is clearly wrong. Both Brioschi and Tums are antacid, and both exists in PWN30 and PWN31. The mapping is pointing 14777104-n to 14802098-n but it should point to 14801263-n.

M 14777104-n => 1 1 False {('14801263-n', 0)} {'14802098-n'}
 WN30 14777104-n {'brioschi%1:27:00::'}
  an antacid 
 WN31 14801263-n {'brioschi%1:27:00::'}
  an antacid
 WN31 14802098-n {'tums%1:27:00::'}
  an antacid

Having both trademarks in WN is strange anyway... but we do not remove them, right? Just do not add more of those in the English Wordnet.

The same error 14777188-n should map to 14801347-n not to 14802098-n

M 14777188-n => 1 1 False {('14801347-n', 0)} {'14802098-n'}
 WN30 14777188-n {'bromo-seltzer%1:27:00::'}
  an antacid 
 WN31 14802098-n {'tums%1:27:00::'}
  an antacid
 WN31 14801347-n {'bromo-seltzer%1:27:00::'}
  an antacid

Same error 14777441-n should map to 14801600-n not 14802098-n

M 14777441-n => 1 1 False {('14801600-n', 0)} {'14802098-n'}
 WN30 14777441-n {'maalox%1:27:00::'}
  an antacid 
 WN31 14801600-n {'maalox%1:27:00::'}
  an antacid
 WN31 14802098-n {'tums%1:27:00::'}
  an antacid
arademaker commented 1 year ago

The mapping is wrong 00767349-s should map to 00770909-s not to 00766556-s

M 00767349-s => 1 1 False {('00770909-s', 0)} {'00766556-s'}
 WN30 00767349-s {'roundabout%5:00:00:indirect:02', 'circuitous%5:00:00:indirect:02'}
  marked by obliqueness or indirection in speech or conduct; "the explanation was circuitous and puzzling"; "a roundabout paragraph"; "hear in a roundabout way that her ex-husband was marrying her best friend" 

 WN31 00766556-s {'devious%5:00:00:indirect:00', 'roundabout%5:00:00:indirect:00', 'circuitous%5:00:00:indirect:00'}
  deviating from a straight course; "a scenic but devious route"; "a long and circuitous journey by train and boat"; "a roundabout route avoided rush-hour traffic"

 WN31 00770909-s {'roundabout%5:00:00:indirect:02', 'circuitous%5:00:00:indirect:02'}
  marked by obliqueness or indirection in speech or conduct; "the explanation was circuitous and puzzling"; "a roundabout paragraph"; "hear in a roundabout way that her ex-husband was marrying her best friend"

same for 00802179-s, it should map to 00805750-s not 00805871-s

M 00802179-s => 1 1 False {('00805750-s', 0)} {'00805871-s'}
 WN30 00802179-s {'keen%5:00:00:sharp:00'}
  having a sharp cutting edge or point; "a keen blade" 
 WN31 00805750-s {'keen%5:00:00:sharp:00'}
  having a sharp cutting edge or point; "a keen blade"
 WN31 00805871-s {'knifelike%5:00:00:sharp:00'}
  cutting or able to cut as if with a knife

and 00780944-s that should map to 00784503-s not 00805750-s

M 00780944-s => 1 1 False {('00784503-s', 0)} {'00805750-s'}
 WN30 00780944-s {'knifelike%5:00:00:distinct:00'}
  having a sharp or distinct edge; "a narrow knifelike profile" 
 WN31 00784503-s {'knifelike%5:00:00:distinct:00'}
  having a sharp or distinct edge; "a narrow knifelike profile"
 WN31 00805750-s {'keen%5:00:00:sharp:00'}
  having a sharp cutting edge or point; "a keen blade"

or we can join the senses from PWN30?

% rg "i4419\t|i4304\t|\t00780944-s|\t00802179-s" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
4297:i4304  00805750-s
4412:i4419  00805871-s

../cili/ili-map-pwn30.tab
4304:i4304  00780944-s
4419:i4419  00802179-s
arademaker commented 1 year ago

I am finding more cases .. but I will stop here. I would love to have some feedback. @goodmami @jmccrae @fcbond. I need to make sure that my observations above make sense, or I am missing something.. I will update PR #17 with all my observations.

arademaker commented 1 year ago

In https://github.com/globalwordnet/cili/issues/16#issuecomment-1269264133, @goodmami mentioned the changes in PWN31 proposed by @fcbond. The question here seems to be. The mapping of PWN30 to PWN31 was created on top of the Princeton releases, right? Later, we may have changes (or patches) for both PWN30 and PWN31.

ekaf commented 1 year ago

@arademaker, your specific examples, and more, correspond to what I find in the attached output file (en-loss.txt), produced using the Wn library from @goodmami, and the sensekey-based mapping algorithm included in the recent NLTK versions. The attached file shows the details for the English row of the losses table in my soon forthcoming conference paper, which you saw me present last week at GWC 2023:

English 117659 117454 205 0.17 117659 117427 232 0.2

The first 4 numbers in that row concern synsets mapped vs. lost with an offset mapping, and the 4 last numbers concern the ILI mapping. The respective losses (205 with offsets vs. 232 with ILI) can be decomposed like this:

English, 143 lost with both offsets and ILI English, 89 lost only with ILI English, 62 lost only with offsets

It is great that you have already started to improve the mappings. Congratulations with that: if your proposed targets can be verified, they will bring us closer to the perfect mappings, which seemed out of reach not long ago!

@jmccrae, the most probable reason why a sense-based mapping misses the above cases would be that their sensekey suffered a small alteration, like f. ex. a change of lexfile or lex_id. Having two sensekeys for the same sense is not a key violation, though. A Db key violation would occur if the same key referred to distinct senses, which does not seem to be the case here. As an example, it is ok to use the same key for Pluto, and define it as either an asteroid or a planet, because it still refers to the same physical object.

Splits are a different problem. Mapping whole synsets cannot handle splits correctly, since splits concern different parts of a synset. Mapping a split synset to two target synsets produces false positives, because every involved sense would get mapped to one correct and one wrong target. Mapping a split to only a single target is still wrong, but yields fewer false positives. The only adequate treatment for splits is a mapping that handles only the concerned senses.