own-pt / glosstag

Semantically Tagged PWN glosses
Other
7 stars 4 forks source link

punctuations #30

Closed arademaker closed 1 year ago

arademaker commented 1 year ago

punctuations can be represented in a more simplified way:

% jq -c -S ".tokens | .[] " data/*.jl | rg "\"punc\""  | sort | uniq -c | sort -nr
50891 {"form":";","kind":["wf"],"pos":":","tag":"ignore","type":"punc"}
48354 {"form":"“","kind":["wf"],"pos":"dq","sep":"","tag":"ignore","type":"punc"}
47610 {"form":"”","kind":["wf"],"pos":"dq","sep":"","tag":"ignore","type":"punc"}
15710 {"form":";","kind":["wf"],"tag":"ignore","type":"punc"}
14024 {"form":"(","kind":["wf"],"pos":"(","sep":"","tag":"ignore","type":"punc"}
8606 {"form":")","kind":["wf"],"pos":")","sep":"","tag":"ignore","type":"punc"}

We could avoid repetition and save some space, considering we have 210429 tokens of type punc.

a. {"form":";","kind":["wf"],"pos":":","tag":"ignore","type":"punc"}
b. {"form":";","kind":["wf"],"tag":"ignore","pos":"punc"}
arademaker commented 1 year ago

before

ar@tenis data % du -hc *.jl
...
236M    total

after

ar@tenis data % du -hc *.jl
186M    total

The punctuations are now:

% jq -c -S ".tokens | .[] " data/*.new | rg "\"punc\"" | sort | uniq -c | sort -nr
66601 {"form":";","kind":["wf"],"pos":"punc","tag":"ignore"}
48354 {"form":"“","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
47610 {"form":"”","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
17460 {"form":"(","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
8773 {"form":")","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
8687 {"form":")","kind":["wf"],"pos":"punc","tag":"ignore"}
6617 {"form":",","kind":["wf"],"pos":"punc","tag":"ignore"}
1159 {"form":"‘","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
1106 {"form":":","kind":["wf"],"pos":"punc","tag":"ignore"}
 744 {"form":"”","kind":["wf"],"pos":"punc","tag":"ignore"}
 732 {"form":"-","kind":["wf"],"pos":"punc","tag":"ignore"}
 677 {"form":"’","kind":["wf"],"pos":"punc","tag":"ignore"}
 486 {"form":"?","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
 482 {"form":"’","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
 370 {"form":"!","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
 210 {"form":"...","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
 190 {"form":"--","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
 103 {"form":",","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
  15 {"form":".","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
  13 {"form":"--","kind":["wf"],"pos":"punc","tag":"ignore"}
   9 {"form":";","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
   7 {"form":"!","kind":["wf"],"pos":"punc","tag":"ignore"}
   6 {"form":"...","kind":["wf"],"pos":"punc","tag":"ignore"}
   5 {"form":":","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
   4 {"form":"\"","kind":["wf"],"pos":"punc","sep":"","tag":"ignore"}
   3 {"form":"?","kind":["wf"],"pos":"punc","tag":"ignore"}
   3 {"form":"/","kind":["wf"],"pos":"punc","tag":"ignore"}
   3 {"form":".","kind":["wf"],"pos":"punc","tag":"ignore"}