Closed arademaker closed 3 years ago
como preparação da release, vamos gerar uns arquivos com estatísticas sobre os arquivos RDF... e usando estas estatísticas podemos conferir esta suspeita de erro do link acima.
no release, vamos ter um TXT com estatisticas da OWN-PT sendo distribuida, para começar podemos ter as tabelas abaixo:
OWN-EN OWN-PT
Core X1 Y1 = numero de X1 com pelo menos 1 palavra em PT
Base X2 Y2 = idem
Noun X3 Y3 = numero de X3 com pelo menos 1 palavra em PT
Verb X4 ...
Adj X5
Adv X6
Obs: OWN-EN equals PWN ....
Polysemy in OWN-PT
# words 1 sense # words >1 sense
Noun
Verb
Adj
Adv
MWE in OWN-PT
# mwe
Noun ...
Verb ...
Adj ...
Adv ...
As discussed before, follow the statistics tables. The queries are available at own-en.rq and own-pt.rq. Those were run in AllegroGraph considering the data at 1a7e2159406520e61ad6643834202a4c4450a8f6
Synset Type | OWN-EN | OWN-PT |
---|---|---|
wn30:CoreConcept | 4960 | 4959 |
wn30:BaseConcept | 4683 | 4365 |
Synset Type | OWN-EN | OWN-PT |
---|---|---|
wn30:NounSynset | 82115 | 35522 |
wn30:VerbSynset | 13767 | 8186 |
wn30:AdverbSynset | 3621 | 1966 |
wn30:AdjectiveSynset | 7463 | 3335 |
wn30:AdjectiveSatelliteSynset | 10693 | 3570 |
wn30:Synset | 117659 | 52579 |
Synset Type | 1 Word | + Words |
---|---|---|
wn30:NounSynset | 42054 | 40061 |
wn30:VerbSynset | 8041 | 5726 |
wn30:AdverbSynset | 2400 | 1221 |
wn30:AdjectiveSynset | 5690 | 1773 |
wn30:AdjectiveSatelliteSynset | 5663 | 5030 |
wn30:Synset | 63848 | 53811 |
Synset Type | 1 Word | + Words |
---|---|---|
wn30:NounSynset | 21523 | 13999 |
wn30:VerbSynset | 4571 | 3615 |
wn30:AdverbSynset | 1219 | 747 |
wn30:AdjectiveSynset | 2462 | 873 |
wn30:AdjectiveSatelliteSynset | 2265 | 1305 |
wn30:Synset | 32040 | 20539 |
Synset Type | OWN-EN | OWN-PT |
---|---|---|
wn30:NounSynset | 60344 | 14747 |
wn30:VerbSynset | 2829 | 824 |
wn30:AdverbSynset | 714 | 598 |
wn30:AdjectiveSynset | 65 | 200 |
wn30:AdjectiveSatelliteSynset | 434 | 368 |
wn30:Synset (distinct lemmas) | 64243 | 16684 |
wn30:Synset (distinct words) | 64383 | 16726 |
wn30:Synset (distinct lemma+ss_type) | 67788 | 16737 |
In Valeria's comment, https://github.com/own-pt/cl-wnbrowser/issues/103#issuecomment-105569359, she discusses about some data differing from original PWN according to a paper. The cited paper says PWN has 118695 synsets (in OWN-EN we found 117659); PWN has 97329 monossemic synsets (in OWN-EN we found 63.848).
But in in the data from http://wordnetcode.princeton.edu/3.0/WNdb-3.0.tar.gz, and in http://compling.hss.ntu.edu.sg/omw/, I found 117659 defined Synsets, and 206978 WordSenses. Exactly same numbers found in OWN-EN.
Statistics in format from http://compling.hss.ntu.edu.sg/omw/. The percentages consider the Cores defined in http://compling.hss.ntu.edu.sg/omw/wn30-core-synsets.tab, with a total of 1960 Core
Wordnet | Lang | Synsets | Words | Senses | Core |
---|---|---|---|---|---|
OWN-PT | pt | 52579 | 57354 | 83768 | 4959 (99,98%) |
OWN-EN | en | 117659 | 156584 | 206978 | 4960 (100%) |
PS: the single Synset not defined with Sense in OWN-PT, is the 01233027-v
Please close this issue by documenting the code I need to execute to populate the statistics.org
file.
In f285c4a5b48b3150188ae1dcb25d0000eabcd06b, we populate a statistics.org file, generated by this script. One might run it as follows:
python3 generate_statistics.py --ownpt openWordnet-PT/data/own-pt-* --ownen openWordnet-PT/data/own-en-* -vv
Second link is not working
Sure. It looks like I added a dot after the link. I've just edited the previous comment with the right link:
See https://github.com/own-pt/cl-wnbrowser/issues/103#issuecomment-105569359