Closed arademaker closed 2 years ago
inconsistency?
Hi @rajkiran-veluri and @abhishek-basu-git, can you guys can look into it?
We need to ensure that the string produced from the detokenization is equal to the text
field in the objects. This is important to allow any further comparison with WordNet 3.0 itself and any error in the text span that will be used by ERG grammar to match the ERG analysis with the token annotations.
Using some edit distance, I could take one step in the direction. From the output below, I know that a semi-colon and space need to be removed from the tokens.
("branched lighting fixture; often ornate; hangs from the ceiling; "
"branched lighting fixture; often ornate; hangs from the ceiling"
((:MATCH #\b #\b) (:MATCH #\r #\r) (:MATCH #\a #\a) (:MATCH #\n #\n)
(:MATCH #\c #\c) (:MATCH #\h #\h) (:MATCH #\e #\e) (:MATCH #\d #\d)
(:MATCH #\ #\ ) (:MATCH #\l #\l) (:MATCH #\i #\i) (:MATCH #\g #\g)
(:MATCH #\h #\h) (:MATCH #\t #\t) (:MATCH #\i #\i) (:MATCH #\n #\n)
(:MATCH #\g #\g) (:MATCH #\ #\ ) (:MATCH #\f #\f) (:MATCH #\i #\i)
(:MATCH #\x #\x) (:MATCH #\t #\t) (:MATCH #\u #\u) (:MATCH #\r #\r)
(:MATCH #\e #\e) (:MATCH #\; #\;) (:MATCH #\ #\ ) (:MATCH #\o #\o)
(:MATCH #\f #\f) (:MATCH #\t #\t) (:MATCH #\e #\e) (:MATCH #\n #\n)
(:MATCH #\ #\ ) (:MATCH #\o #\o) (:MATCH #\r #\r) (:MATCH #\n #\n)
(:MATCH #\a #\a) (:MATCH #\t #\t) (:MATCH #\e #\e) (:MATCH #\; #\;)
(:MATCH #\ #\ ) (:MATCH #\h #\h) (:MATCH #\a #\a) (:MATCH #\n #\n)
(:MATCH #\g #\g) (:MATCH #\s #\s) (:MATCH #\ #\ ) (:MATCH #\f #\f)
(:MATCH #\r #\r) (:MATCH #\o #\o) (:MATCH #\m #\m) (:MATCH #\ #\ )
(:MATCH #\t #\t) (:MATCH #\h #\h) (:MATCH #\e #\e) (:MATCH #\ #\ )
(:MATCH #\c #\c) (:MATCH #\e #\e) (:MATCH #\i #\i) (:MATCH #\l #\l)
(:MATCH #\i #\i) (:MATCH #\n #\n) (:MATCH #\g #\g) (:DELETION #\; NIL)
(:DELETION #\ NIL)))
But the CL Library I am using is limited, see https://github.com/belambert/cl-edit-distance/issues/9, and it sometimes gives me a list of operations that were not trivial to transform to token changes... I need to rethink the strategy.
I have got one more unexpected mismatch between the glosses and the annotations (obtained from the XML files):
/Users/ar/work/wn/glosstag/data/annotation-ag.jl
00204249
disposed to avoid notice; "they considered themselves a tough outfit and weren't bashful about letting anybody know it"; (`blate' is a Scottish term for bashful)
disposed to avoid notice; (‘blate’ is a Scottish term for bashful); “they considered themselves a tough outfit and weren't bashful about letting anybody know it” ;
((INSERTION NIL t) (INSERTION NIL i) (INSERTION NIL c) (INSERTION NIL e)
(INSERTION NIL ;) (INSERTION NIL ) (INSERTION NIL () (INSERTION NIL ‘)
(INSERTION NIL b) (INSERTION NIL l) (INSERTION NIL a) (INSERTION NIL e)
(INSERTION NIL ’) (INSERTION NIL ) (INSERTION NIL s) (INSERTION NIL )
(INSERTION NIL a) (INSERTION NIL ) (INSERTION NIL S) (INSERTION NIL o)
(INSERTION NIL t) (INSERTION NIL t) (INSERTION NIL i) (INSERTION NIL s)
(INSERTION NIL h) (INSERTION NIL ) (INSERTION NIL t) (INSERTION NIL r)
(INSERTION NIL m) (INSERTION NIL ) (INSERTION NIL f) (INSERTION NIL o)
(INSERTION NIL r) (INSERTION NIL ) (INSERTION NIL b) (INSERTION NIL a)
(INSERTION NIL s) (INSERTION NIL h) (INSERTION NIL f) (INSERTION NIL u)
(INSERTION NIL l) (INSERTION NIL )) (SUBSTITUTION " “) (DELETION NIL)
(DELETION i NIL) (DELETION t NIL) (DELETION " NIL) (DELETION ; NIL)
(DELETION NIL) (DELETION ( NIL) (DELETION ` NIL) (DELETION b NIL)
(DELETION l NIL) (DELETION a NIL) (DELETION t NIL) (DELETION e NIL)
(DELETION ' NIL) (DELETION NIL) (DELETION i NIL) (DELETION s NIL)
(DELETION NIL) (DELETION a NIL) (DELETION S NIL) (DELETION c NIL)
(DELETION o NIL) (DELETION t NIL) (DELETION t NIL) (DELETION s NIL)
(DELETION h NIL) (DELETION NIL) (DELETION e NIL) (DELETION r NIL)
(SUBSTITUTION m ”) (DELETION f NIL) (DELETION o NIL) (SUBSTITUTION r ;)
(DELETION b NIL) (DELETION a NIL) (DELETION s NIL) (DELETION h NIL)
(DELETION f NIL) (DELETION u NIL) (DELETION l NIL) (DELETION ) NIL))
In the DB files of WN 3.0, the extra information between parenthesis was added after the examples. The DTD from the Princeton 2008 release of GlossTag does not allow that. See https://github.com/own-pt/glosstag/blob/princeton/dtd/glosstag.dtd#L25. The aux
may happen before def
, after def
or inside def
but not inside ex
nor after ex
.
So maybe users have manually edited this gloss during the data preparation, around ~23 cases:
(venv) ar@tenis glosstag-kg % rg ^00204249 ../WordNet-3.0/dict/data.adj
1129:00204249 00 s 02 bashful 0 blate 0 002 & 00204077 a 0000 ;r 08890097 n 0000 | disposed to avoid notice; "they considered themselves a tough outfit and weren't bashful about letting anybody know it"; (`blate' is a Scottish term for bashful)
In the original XML files we have the text
and orig
and wsd
attributes for each gloss tag. The wsd is the tokens. The org seems to be the original WordNet 3.0 gloss and the text
is
Do[="do"] n't[="not"]
So we don't have an intermediary version with only (1) to take as reference to ERG analysis. In the data/*.jl
files we kept only the orig
text, the one that matches with the DB Files of WordNet 3.0.
why
and