Open arademaker opened 1 year ago
% rg ^03021531 WordNet-3.0/dict/data.*
WordNet-3.0/dict/data.noun
16359:03021531 06 n 02 chlorambucil 0 Leukeran 0 002 @ 02697438 n 0000 ;u 06845599 n 0201 | an alkalating agent (trade name Leukeran) used to treat some kinds of cancer
% rg ^03025214 WordNet-3.1-dict/data.*
WordNet-3.1-dict/data.noun
16369:03025214 06 n 02 chlorambucil 0 Leukeran 0 002 @ 02700297 n 0000 ;u 06858649 n 0201 | an alkylating agent (trade name Leukeran) used to treat some kinds of cancer
% rg 03021531 WordNet-3.0/dict/index.sense
33222:chlorambucil%1:06:00:: 03021531 1 0
106538:leukeran%1:06:00:: 03021531 1 0
% rg 03025214 WordNet-3.1-dict/index.sense
33231:chlorambucil%1:06:00:: 03025214 1 0
106656:leukeran%1:06:00:: 03025214 1 0
% rg "03021531-n|03025214-n" cili/ili-map-p*
cili/ili-map-pwn30.tab
51874:i51874 03021531-n
So PWN3.0 03021531-n should be mapped to PWN 3.1 03025214-n, right? But i51874 only maps to the PWN3.0
In another case, the gloss changed, but it seems to be the same concept:
% rg ^04231905 WordNet-3.0/dict/data.*
WordNet-3.0/dict/data.noun
23507:04231905 06 n 01 Skivvies 0 002 @ 04508949 n 0000 ;u 06851742 n 0000 | men's underwear consisting of cotton T-shirt and shorts
% rg ^04238967 WordNet-3.1-dict/data.*
WordNet-3.1-dict/data.noun
23532:04238967 06 n 01 skivvies 0 003 @ 04516244 n 0000 ;u 06864792 n 0000 ;u 06306016 n 0000 | (used in the plural) men's underwear consisting of cotton undershirt and underpants
% rg 04231905 WordNet-3.0/dict/index.sense
168480:skivvies%1:06:00:: 04231905 1 0
% rg 04238967 WordNet-3.1-dict/index.sense
168716:skivvies%1:06:00:: 04238967 1 0
% rg "04231905-n|04238967-n" cili/ili-map-p*
cili/ili-map-pwn30.tab
59022:i59022 04231905-n
% rg ^02440996 WordNet-3.0/dict/data.*
WordNet-3.0/dict/data.adj
13550:02440996 00 s 01 inferior 0 002 & 02440691 a 0000 ;c 06057539 n 0000 | lower than a given reference point; "inferior alveolar artery"
% rg ^02450200 WordNet-3.1-dict/data.*
WordNet-3.1-dict/data.adj
13568:02450200 00 s 01 inferior 0 002 & 02449895 a 0000 ;c 06067070 n 0000 | lower than a given reference point; "inferior alveolar artery"
% rg 02440996 WordNet-3.0/dict/index.sense
95599:inferior%5:00:00:bottom:00 02440996 5 0
% rg 02450200 WordNet-3.1-dict/index.sense
95695:inferior%5:00:00:bottom:00 02450200 5 0
% rg "02440996-s|02450200-s" cili/ili-map-p*
cili/ili-map-pwn30.tab
13521:i13521 02440996-s
The last one
% rg ^01827261 WordNet-3.0/dict/data.*
WordNet-3.0/dict/data.adj
10064:01827261 00 s 01 regent(ip) 0 004 & 01825671 a 0000 ;u 06307152 n 0000 + 10516117 n 0101 + 00598970 n 0101 | acting or functioning as a regent or ruler; "prince-regent"
% rg ^01832979 WordNet-3.1-dict/data.*
WordNet-3.1-dict/data.adj
10068:01832979 00 s 01 regent(ip) 0 004 & 01831389 a 0000 ;u 06318142 n 0000 + 10535710 n 0101 + 00600085 n 0101 | acting or functioning as a regent or ruler; "prince-regent"
% rg 01827261 WordNet-3.0/dict/index.sense
151759:regent%5:00:00:powerful:00 01827261 1 0
% rg 01832979 WordNet-3.1-dict/index.sense
151962:regent%5:00:00:powerful:00 01832979 1 0
% rg "01827261-s|01832979-s" cili/ili-map-p*
cili/ili-map-pwn30.tab
10035:i10035 01827261-s
I can make a PR to change the ili-map-pwn31.tab
file if someone can confirm the errors or justify the difference.
I can't really explain this, as the mapping was completed by PWN senses, so these should be linked.
I see 274 'new' senses in PWN 3.1 according to OEWN and some of these are genuinely new (e.g. 'Barack Obama') others don't seem to be. If you are capable of identifying these automatically it would be a great help
FWIW, the first two are listed as deprecated in changes-in-wn31.csv:
$ grep -P 'i51874|i59022|i13521|i10035' changes-in-wn31.csv
deprecated,ili:i51874,03021531-n,none,chlorambucil/Leukeran
deprecated,ili:i59022,04231905-n,none,Skivvies
The other two words are in the file under a different ILI:
grep -P 'inferior|regent' changes-in-wn31.csv
deprecated,ili:i13656,01827261-s,none,regent
deprecated,ili:i17142,02440996-s,none,inferior
new,,,01832979-a,regent(ip)
new,,,02450200-a,inferior
@fcbond do you know what is the story here?
M 00929443-s => 1 1 False {('00932684-s', 0)} {'00041424-s'}
WN30 00929443-s {'dead%5:00:00:extinct:01'}
not surviving in active use; "Latin is a dead language"
WN31 00932684-s {'dead%5:00:00:extinct:01'}
not surviving in active use; "Latin is a dead language"
WN31 00041424-s {'dead%5:00:00:extinct:02'}
physically inactive; "Crater Lake is in the crater of a dead volcano of the Cascade Range"
The mapping says that two concepts from PWN30 were merged in PWN31. But 00041202-s
in PWN30 is actually 00041202-a.
% rg "i208\t|i5092\t|\t00041424-s" ../cili/ili-map-p*
../cili/ili-map-pwn30.tab
208:i208 00041202-s
5092:i5092 00929443-s
../cili/ili-map-pwn31.tab
206:i208 00041424-s
5084:i5092 00041424-s
00929443-s should map to 00932684-s
M 10210648-n => 2 1 False {('10230422-n', 2), ('10230249-n', 4)} {'10230249-n'}
WN30 10210648-n {'interior_decorator%1:18:00::', 'room_decorator%1:18:00::', 'designer%1:18:02::', 'decorator%1:18:01::', 'house_decorator%1:18:00::', 'interior_designer%1:18:00::'}
a person who specializes in designing architectural interiors and their furnishings
WN31 10230249-n {'interior_designer%1:18:00::', 'designer%1:18:02::'}
a person who specializes in interior design
WN31 10230422-n {'decorator%1:18:01::', 'room_decorator%1:18:00::', 'house_decorator%1:18:00::', 'interior_decorator%1:18:00::'}
a person who specializes in interior decoration
The concept in PWN30 was split into two in PWN31, right? But the mapping is not reflecting that:
% rg "i90722\t|\t10210648-n" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
90654:i90722 10230249-n
../cili/ili-map-pwn30.tab
90722:i90722 10210648-n
I would say that 10210648-n maps to both 10230249-n and 10230422-n in PWN31.
Same for
M 00040058-s => 2 1 False {('00040305-s', 2), ('00040189-s', 1)} {'00040189-s'}
WN30 00040058-s {'supine%5:00:00:passive:01', 'unresisting%5:00:00:passive:01', 'resistless%5:00:00:passive:01'}
offering no resistance; "resistless hostages"; "No other colony showed such supine, selfish helplessness in allowing her own border citizens to be mercilessly harried"- Theodore Roosevelt
WN31 00040189-s {'unresisting%5:00:00:passive:01', 'resistless%5:00:00:passive:01'}
offering no resistance; "resistless hostages"
WN31 00040305-s {'supine%5:00:00:passive:01'}
passive as a result of indolence or indifference; "No other colony showed such supine, selfish helplessness in allowing her own border citizens to be mercilessly harried"- Theodore Roosevelt
and also
WN30 00949619-n {'engineering%1:04:01::', 'technology%1:04:00::'}
the practical application of science to commerce or industry
WN31 00951878-n {'engineering%1:04:01::'}
the practical application of technical and scientific knowledge to commerce or industry
WN31 00951435-n {'technology%1:04:00::'}
the application of the knowledge and usage of tools (such as machines or utensils) and techniques to control one's environment; "the mastery of fire was a huge advance in human technology"
M 06823760-n => 2 1 False {('06836640-n', 2), ('06836790-n', 1)} {'06836640-n'}
WN30 06823760-n {'umlaut%1:10:00::', 'dieresis%1:10:00::', 'diaeresis%1:10:00::'}
a diacritical mark (two dots) placed over a vowel in German to indicate a change in sound
WN31 06836640-n {'umlaut%1:10:00::'}
a diacritical mark (two dots) placed over a vowel to indicate a change in sound in some languages
WN31 06836790-n {'dieresis%1:10:00::', 'diaeresis%1:10:00::'}
a diacritical mark (two dots) placed over a vowel to indicate that it does not form a diphthong with an adjacent vowel
Is this the same case as above?
% rg "i72354\t|\t06823760" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
72312:i72354 06836640-n
../cili/ili-map-pwn30.tab
72354:i72354 06823760-n
We can say the sense was split, so the PWN30 synset needs to map both synsets in PW31.
= 06823760-n 06836640-n = 06823760-n 06836790-n
Or we can say that none of the new synsets are replacements for the old PWN30 synset; they are generalizations. So PWN30 is <= both PWN31.
That would force us to extend the mapping to deal with more fine-grained relations rather than only equality. BTW, can someone see the reason for splitting this sense from PWN30?
Here the mapping seems right, but the PWN31 structure can be changed:
M 10012484-n => 2 1 False {('10032138-n', 2), ('10032289-n', 1)} {'10032138-n'}
WN30 10012484-n {'nutritionist%1:18:00::', 'dietician%1:18:00::', 'dietitian%1:18:00::'}
a specialist in the study of nutrition
WN31 10032289-n {'dietician%1:18:00::', 'dietitian%1:18:00::'}
a specialist in the study of diet and nutrition
WN31 10032138-n {'nutritionist%1:18:00::'}
a specialist in the study of nutrition
If someone is a nutritionist, he/she is also a dietician, right? Because X ∧ Y ⊆ X
, so 10032289-n is a hyponym of 10032138-n? But they are sisters in PWN 31.
M 00042692-s => 1 1 False {('00042912-s', 0)} {'00035037-s'}
WN30 00042692-s {'activated%5:00:00:active:07'}
rendered active; e.g. rendered radioactive or luminescent or photosensitive or conductive
WN31 00035037-s {'activated%5:00:00:active:08'}
(military) set up and placed on active assignment; "a newly activated unit"
WN31 00042912-s {'activated%5:00:00:active:07'}
rendered active; e.g. rendered radioactive or luminescent or photosensitive or conductive
The mapping is wrong. it points 00042692-s to 00035037-s but 00042912-s is more appropriate, right?
% rg "i216\t|\t00042692-s" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
214:i216 00035037-s
../cili/ili-map-pwn30.tab
216:i216 00042692-s
M 02312060-s => 1 1 False {('02319740-s', 0)} {'02319740-a'}
WN30 02312060-s {'akimbo%5:00:00:crooked:01'}
(used of arms and legs) bent outward with the joint away from the body; "a tailor sitting with legs akimbo"; "stood with arms akimbo"
WN31 02319740-a set()
(used of arms and legs) bent outward with the joint away from the body; "a tailor sitting with legs akimbo"; "stood with arms akimbo"
WN31 02319740-s {'akimbo%5:00:00:crooked:01'}
(used of arms and legs) bent outward with the joint away from the body; "a tailor sitting with legs akimbo"; "stood with arms akimbo"
There is no 02319740-a in PWN31, but 02319740-s, it is a satellite synset.
% rg ^02319740 ../WordNet-3.1-dict/data.adj
12840:02319740 00 s 01 akimbo(ip) 0 001 & 02319224 a 0000 | (used of arms and legs) bent outward with the joint away from the body; "a tailor sitting with legs akimbo"; "stood with arms akimbo"
M 00675928-s => 1 1 False {('00679196-s', 0)} {'00679361-s'}
WN30 00675928-s {'alternating%5:00:01:cyclic:01', 'alternate%5:00:01:cyclic:01'}
occurring by turns; first one and then the other; "alternating feelings of love and hate"
WN31 00679196-s {'alternating%5:00:01:cyclic:01', 'alternate%5:00:01:cyclic:01'}
occurring by turns; first one and then the other; "alternating feelings of love and hate"
WN31 00679361-s {'alternate%5:00:02:cyclic:01'}
every second one of a series; "the cleaning lady comes on alternate Wednesdays"; "jam every other day"- the White Queen
Once more, it seems that the mapping is wrong. 00675928-s from PWN30 is 00679196-s in PWN31 not 00679361-s.
% rg "i3764\t|\t00675928-s" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
3758:i3764 00679361-s
../cili/ili-map-pwn30.tab
3764:i3764 00675928-s
M 02713992-n => 2 0 False {('02716929-n', 1), ('02716785-n', 1)} set()
WN30 02713992-n {'roundel%1:06:01::', 'annulet%1:06:02::'}
(heraldry) a charge in the shape of a circle; "a hollow roundel"
WN31 02716785-n {'roundel%1:06:01::'}
(heraldry) a charge in the shape of a filled circle; "a hollow roundel"
WN31 02716929-n {'annulet%1:06:02::'}
(heraldry) a charge in the shape of a small ring
The mapping is missing 02713992-n to 02716929-n, or maybe we can map to both 02716785-n and 02716929-n. but considering the example and definition, mapping to 02716785-n looks better.
The mapping is clearly wrong. The 02599754-v from PWN30 should maps to 02605751-v in PWN31
M 02599754-v => 1 1 False {('02605751-v', 0)} {'00680696-v'}
WN30 02599754-v {'book%2:41:03::'}
register in a hotel booker
WN31 02605751-v {'book%2:41:03::'}
register in a hotel booker
WN31 00680696-v {'book%2:31:00::'}
engage for a performance; "Her agent had booked her for several concerts in Tokyo"
% rg "i34688\t|\t02599754-v" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
34655:i34688 00680696-v
../cili/ili-map-pwn30.tab
34688:i34688 02599754-v
The mapping is clearly wrong. Both Brioschi and Tums are antacid, and both exists in PWN30 and PWN31. The mapping is pointing 14777104-n to 14802098-n but it should point to 14801263-n.
M 14777104-n => 1 1 False {('14801263-n', 0)} {'14802098-n'}
WN30 14777104-n {'brioschi%1:27:00::'}
an antacid
WN31 14801263-n {'brioschi%1:27:00::'}
an antacid
WN31 14802098-n {'tums%1:27:00::'}
an antacid
Having both trademarks in WN is strange anyway... but we do not remove them, right? Just do not add more of those in the English Wordnet.
The same error 14777188-n should map to 14801347-n not to 14802098-n
M 14777188-n => 1 1 False {('14801347-n', 0)} {'14802098-n'}
WN30 14777188-n {'bromo-seltzer%1:27:00::'}
an antacid
WN31 14802098-n {'tums%1:27:00::'}
an antacid
WN31 14801347-n {'bromo-seltzer%1:27:00::'}
an antacid
Same error 14777441-n should map to 14801600-n not 14802098-n
M 14777441-n => 1 1 False {('14801600-n', 0)} {'14802098-n'}
WN30 14777441-n {'maalox%1:27:00::'}
an antacid
WN31 14801600-n {'maalox%1:27:00::'}
an antacid
WN31 14802098-n {'tums%1:27:00::'}
an antacid
The mapping is wrong 00767349-s should map to 00770909-s not to 00766556-s
M 00767349-s => 1 1 False {('00770909-s', 0)} {'00766556-s'}
WN30 00767349-s {'roundabout%5:00:00:indirect:02', 'circuitous%5:00:00:indirect:02'}
marked by obliqueness or indirection in speech or conduct; "the explanation was circuitous and puzzling"; "a roundabout paragraph"; "hear in a roundabout way that her ex-husband was marrying her best friend"
WN31 00766556-s {'devious%5:00:00:indirect:00', 'roundabout%5:00:00:indirect:00', 'circuitous%5:00:00:indirect:00'}
deviating from a straight course; "a scenic but devious route"; "a long and circuitous journey by train and boat"; "a roundabout route avoided rush-hour traffic"
WN31 00770909-s {'roundabout%5:00:00:indirect:02', 'circuitous%5:00:00:indirect:02'}
marked by obliqueness or indirection in speech or conduct; "the explanation was circuitous and puzzling"; "a roundabout paragraph"; "hear in a roundabout way that her ex-husband was marrying her best friend"
same for 00802179-s, it should map to 00805750-s not 00805871-s
M 00802179-s => 1 1 False {('00805750-s', 0)} {'00805871-s'}
WN30 00802179-s {'keen%5:00:00:sharp:00'}
having a sharp cutting edge or point; "a keen blade"
WN31 00805750-s {'keen%5:00:00:sharp:00'}
having a sharp cutting edge or point; "a keen blade"
WN31 00805871-s {'knifelike%5:00:00:sharp:00'}
cutting or able to cut as if with a knife
and 00780944-s that should map to 00784503-s not 00805750-s
M 00780944-s => 1 1 False {('00784503-s', 0)} {'00805750-s'}
WN30 00780944-s {'knifelike%5:00:00:distinct:00'}
having a sharp or distinct edge; "a narrow knifelike profile"
WN31 00784503-s {'knifelike%5:00:00:distinct:00'}
having a sharp or distinct edge; "a narrow knifelike profile"
WN31 00805750-s {'keen%5:00:00:sharp:00'}
having a sharp cutting edge or point; "a keen blade"
or we can join the senses from PWN30?
% rg "i4419\t|i4304\t|\t00780944-s|\t00802179-s" ../cili/ili-map-p*
../cili/ili-map-pwn31.tab
4297:i4304 00805750-s
4412:i4419 00805871-s
../cili/ili-map-pwn30.tab
4304:i4304 00780944-s
4419:i4419 00802179-s
I am finding more cases .. but I will stop here. I would love to have some feedback. @goodmami @jmccrae @fcbond. I need to make sure that my observations above make sense, or I am missing something.. I will update PR #17 with all my observations.
In https://github.com/globalwordnet/cili/issues/16#issuecomment-1269264133, @goodmami mentioned the changes in PWN31 proposed by @fcbond. The question here seems to be. The mapping of PWN30 to PWN31 was created on top of the Princeton releases, right? Later, we may have changes (or patches) for both PWN30 and PWN31.
@arademaker, your specific examples, and more, correspond to what I find in the attached output file (en-loss.txt), produced using the Wn library from @goodmami, and the sensekey-based mapping algorithm included in the recent NLTK versions. The attached file shows the details for the English row of the losses table in my soon forthcoming conference paper, which you saw me present last week at GWC 2023:
English 117659 117454 205 0.17 117659 117427 232 0.2
The first 4 numbers in that row concern synsets mapped vs. lost with an offset mapping, and the 4 last numbers concern the ILI mapping. The respective losses (205 with offsets vs. 232 with ILI) can be decomposed like this:
English, 143 lost with both offsets and ILI English, 89 lost only with ILI English, 62 lost only with offsets
It is great that you have already started to improve the mappings. Congratulations with that: if your proposed targets can be verified, they will bring us closer to the perfect mappings, which seemed out of reach not long ago!
@jmccrae, the most probable reason why a sense-based mapping misses the above cases would be that their sensekey suffered a small alteration, like f. ex. a change of lexfile or lex_id. Having two sensekeys for the same sense is not a key violation, though. A Db key violation would occur if the same key referred to distinct senses, which does not seem to be the case here. As an example, it is ok to use the same key for Pluto, and define it as either an asteroid or a planet, because it still refers to the same physical object.
Splits are a different problem. Mapping whole synsets cannot handle splits correctly, since splits concern different parts of a synset. Mapping a split synset to two target synsets produces false positives, because every involved sense would get mapped to one correct and one wrong target. Mapping a split to only a single target is still wrong, but yields fewer false positives. The only adequate treatment for splits is a mapping that handles only the concerned senses.
where is the construction of the mappings from CILI to PWN30 and PWN31 documented?