tokenization - Githubissues

arademaker commented 5 years ago

why

(n) ruling,opinion | the reason for a court's judgment ( as opposed to the decision itself) ;

and

(v) wash | admit to testing or proof ; This silly excuse wo n't wash in traffic court ;

arademaker commented 5 years ago

inconsistency?

arademaker commented 2 years ago

Hi @rajkiran-veluri and @abhishek-basu-git, can you guys can look into it?

arademaker commented 2 years ago

We need to ensure that the string produced from the detokenization is equal to the text field in the objects. This is important to allow any further comparison with WordNet 3.0 itself and any error in the text span that will be used by ERG grammar to match the ERG analysis with the token annotations.

Using some edit distance, I could take one step in the direction. From the output below, I know that a semi-colon and space need to be removed from the tokens.

("branched lighting fixture; often ornate; hangs from the ceiling; "
  "branched lighting fixture; often ornate; hangs from the ceiling"
  ((:MATCH #\b #\b) (:MATCH #\r #\r) (:MATCH #\a #\a) (:MATCH #\n #\n)
   (:MATCH #\c #\c) (:MATCH #\h #\h) (:MATCH #\e #\e) (:MATCH #\d #\d)
   (:MATCH #\  #\ ) (:MATCH #\l #\l) (:MATCH #\i #\i) (:MATCH #\g #\g)
   (:MATCH #\h #\h) (:MATCH #\t #\t) (:MATCH #\i #\i) (:MATCH #\n #\n)
   (:MATCH #\g #\g) (:MATCH #\  #\ ) (:MATCH #\f #\f) (:MATCH #\i #\i)
   (:MATCH #\x #\x) (:MATCH #\t #\t) (:MATCH #\u #\u) (:MATCH #\r #\r)
   (:MATCH #\e #\e) (:MATCH #\; #\;) (:MATCH #\  #\ ) (:MATCH #\o #\o)
   (:MATCH #\f #\f) (:MATCH #\t #\t) (:MATCH #\e #\e) (:MATCH #\n #\n)
   (:MATCH #\  #\ ) (:MATCH #\o #\o) (:MATCH #\r #\r) (:MATCH #\n #\n)
   (:MATCH #\a #\a) (:MATCH #\t #\t) (:MATCH #\e #\e) (:MATCH #\; #\;)
   (:MATCH #\  #\ ) (:MATCH #\h #\h) (:MATCH #\a #\a) (:MATCH #\n #\n)
   (:MATCH #\g #\g) (:MATCH #\s #\s) (:MATCH #\  #\ ) (:MATCH #\f #\f)
   (:MATCH #\r #\r) (:MATCH #\o #\o) (:MATCH #\m #\m) (:MATCH #\  #\ )
   (:MATCH #\t #\t) (:MATCH #\h #\h) (:MATCH #\e #\e) (:MATCH #\  #\ )
   (:MATCH #\c #\c) (:MATCH #\e #\e) (:MATCH #\i #\i) (:MATCH #\l #\l)
   (:MATCH #\i #\i) (:MATCH #\n #\n) (:MATCH #\g #\g) (:DELETION #\; NIL)
   (:DELETION #\  NIL)))

But the CL Library I am using is limited, see https://github.com/belambert/cl-edit-distance/issues/9, and it sometimes gives me a list of operations that were not trivial to transform to token changes... I need to rethink the strategy.

arademaker commented 2 years ago

I have got one more unexpected mismatch between the glosses and the annotations (obtained from the XML files):

/Users/ar/work/wn/glosstag/data/annotation-ag.jl
 00204249
 disposed to avoid notice; "they considered themselves a tough outfit and weren't bashful about letting anybody know it"; (`blate' is a Scottish term for bashful)
 disposed to avoid notice; (‘blate’ is a Scottish term for bashful); “they considered themselves a tough outfit and weren't bashful about letting anybody know it” ; 
 ((INSERTION NIL t) (INSERTION NIL i) (INSERTION NIL c) (INSERTION NIL e)
  (INSERTION NIL ;) (INSERTION NIL  ) (INSERTION NIL () (INSERTION NIL ‘)
  (INSERTION NIL b) (INSERTION NIL l) (INSERTION NIL a) (INSERTION NIL e)
  (INSERTION NIL ’) (INSERTION NIL  ) (INSERTION NIL s) (INSERTION NIL  )
  (INSERTION NIL a) (INSERTION NIL  ) (INSERTION NIL S) (INSERTION NIL o)
  (INSERTION NIL t) (INSERTION NIL t) (INSERTION NIL i) (INSERTION NIL s)
  (INSERTION NIL h) (INSERTION NIL  ) (INSERTION NIL t) (INSERTION NIL r)
  (INSERTION NIL m) (INSERTION NIL  ) (INSERTION NIL f) (INSERTION NIL o)
  (INSERTION NIL r) (INSERTION NIL  ) (INSERTION NIL b) (INSERTION NIL a)
  (INSERTION NIL s) (INSERTION NIL h) (INSERTION NIL f) (INSERTION NIL u)
  (INSERTION NIL l) (INSERTION NIL )) (SUBSTITUTION " “) (DELETION   NIL)
  (DELETION i NIL) (DELETION t NIL) (DELETION " NIL) (DELETION ; NIL)
  (DELETION   NIL) (DELETION ( NIL) (DELETION ` NIL) (DELETION b NIL)
  (DELETION l NIL) (DELETION a NIL) (DELETION t NIL) (DELETION e NIL)
  (DELETION ' NIL) (DELETION   NIL) (DELETION i NIL) (DELETION s NIL)
  (DELETION   NIL) (DELETION a NIL) (DELETION S NIL) (DELETION c NIL)
  (DELETION o NIL) (DELETION t NIL) (DELETION t NIL) (DELETION s NIL)
  (DELETION h NIL) (DELETION   NIL) (DELETION e NIL) (DELETION r NIL)
  (SUBSTITUTION m ”) (DELETION f NIL) (DELETION o NIL) (SUBSTITUTION r ;)
  (DELETION b NIL) (DELETION a NIL) (DELETION s NIL) (DELETION h NIL)
  (DELETION f NIL) (DELETION u NIL) (DELETION l NIL) (DELETION ) NIL))

In the DB files of WN 3.0, the extra information between parenthesis was added after the examples. The DTD from the Princeton 2008 release of GlossTag does not allow that. See https://github.com/own-pt/glosstag/blob/princeton/dtd/glosstag.dtd#L25. The aux may happen before def, after def or inside def but not inside ex nor after ex.

So maybe users have manually edited this gloss during the data preparation, around ~23 cases:

(venv) ar@tenis glosstag-kg % rg ^00204249 ../WordNet-3.0/dict/data.adj

1129:00204249 00 s 02 bashful 0 blate 0 002 & 00204077 a 0000 ;r 08890097 n 0000 | disposed to avoid notice; "they considered themselves a tough outfit and weren't bashful about letting anybody know it"; (`blate' is a Scottish term for bashful)

In the original XML files we have the text and orig and wsd attributes for each gloss tag. The wsd is the tokens. The org seems to be the original WordNet 3.0 gloss and the text is

the edited version of the gloss with changes such as the above
further tokenized with extra spaces between parenthesis, quotes, punctuations etc
some tokens analysis such as Do[="do"] n't[="not"]

So we don't have an intermediary version with only (1) to take as reference to ERG analysis. In the data/*.jl files we kept only the orig text, the one that matches with the DB Files of WordNet 3.0.

own-pt / glosstag

tokenization #9