own-pt / openWordnet-PT

OpenWordnet-PT: an open access wordnet for Portuguese
http://openwordnet-pt.org
Other
154 stars 35 forks source link

Investigate the issue with PWN numbers #159

Closed arademaker closed 3 years ago

arademaker commented 5 years ago

See https://github.com/own-pt/cl-wnbrowser/issues/103#issuecomment-105569359

arademaker commented 3 years ago

como preparação da release, vamos gerar uns arquivos com estatísticas sobre os arquivos RDF... e usando estas estatísticas podemos conferir esta suspeita de erro do link acima.

arademaker commented 3 years ago

no release, vamos ter um TXT com estatisticas da OWN-PT sendo distribuida, para começar podemos ter as tabelas abaixo:

       OWN-EN  OWN-PT
Core   X1       Y1 = numero de X1 com pelo menos 1 palavra em PT
Base   X2       Y2 = idem

Noun   X3       Y3 = numero de X3 com pelo menos 1 palavra em PT
Verb   X4       ...
Adj    X5
Adv    X6

Obs: OWN-EN equals PWN ....

Polysemy in OWN-PT

      # words 1 sense     # words >1 sense
Noun
Verb
Adj
Adv

MWE in OWN-PT

       # mwe
Noun   ...
Verb   ...
Adj    ...
Adv    ...
FredsoNerd commented 3 years ago

As discussed before, follow the statistics tables. The queries are available at own-en.rq and own-pt.rq. Those were run in AllegroGraph considering the data at 1a7e2159406520e61ad6643834202a4c4450a8f6

Table of Contents

  1. Base and Core Concepts
  2. Instantiated Synsets
  3. Polysemy OWN-EN
  4. Polysemy OWN-PT
  5. Multi Word Expressions

Base and Core Concepts

Synset Type OWN-EN OWN-PT
wn30:CoreConcept 4960 4959
wn30:BaseConcept 4683 4365

Instantiated Synsets

Synset Type OWN-EN OWN-PT
wn30:NounSynset 82115 35522
wn30:VerbSynset 13767 8186
wn30:AdverbSynset 3621 1966
wn30:AdjectiveSynset 7463 3335
wn30:AdjectiveSatelliteSynset 10693 3570
wn30:Synset 117659 52579

Polysemy OWN-EN

Synset Type 1 Word + Words
wn30:NounSynset 42054 40061
wn30:VerbSynset 8041 5726
wn30:AdverbSynset 2400 1221
wn30:AdjectiveSynset 5690 1773
wn30:AdjectiveSatelliteSynset 5663 5030
wn30:Synset 63848 53811

Polysemy OWN-PT

Synset Type 1 Word + Words
wn30:NounSynset 21523 13999
wn30:VerbSynset 4571 3615
wn30:AdverbSynset 1219 747
wn30:AdjectiveSynset 2462 873
wn30:AdjectiveSatelliteSynset 2265 1305
wn30:Synset 32040 20539

Multi Word Expressions

Synset Type OWN-EN OWN-PT
wn30:NounSynset 60344 14747
wn30:VerbSynset 2829 824
wn30:AdverbSynset 714 598
wn30:AdjectiveSynset 65 200
wn30:AdjectiveSatelliteSynset 434 368
wn30:Synset (distinct lemmas) 64243 16684
wn30:Synset (distinct words) 64383 16726
wn30:Synset (distinct lemma+ss_type) 67788 16737
FredsoNerd commented 3 years ago

In Valeria's comment, https://github.com/own-pt/cl-wnbrowser/issues/103#issuecomment-105569359, she discusses about some data differing from original PWN according to a paper. The cited paper says PWN has 118695 synsets (in OWN-EN we found 117659); PWN has 97329 monossemic synsets (in OWN-EN we found 63.848).

But in in the data from http://wordnetcode.princeton.edu/3.0/WNdb-3.0.tar.gz, and in http://compling.hss.ntu.edu.sg/omw/, I found 117659 defined Synsets, and 206978 WordSenses. Exactly same numbers found in OWN-EN.

FredsoNerd commented 3 years ago

Statistics in format from http://compling.hss.ntu.edu.sg/omw/. The percentages consider the Cores defined in http://compling.hss.ntu.edu.sg/omw/wn30-core-synsets.tab, with a total of 1960 Core

Open Multilingual Wordnet

Wordnet Lang Synsets Words Senses Core
OWN-PT pt 52579 57354 83768 4959 (99,98%)
OWN-EN en 117659 156584 206978 4960 (100%)

PS: the single Synset not defined with Sense in OWN-PT, is the 01233027-v

arademaker commented 3 years ago

Please close this issue by documenting the code I need to execute to populate the statistics.org file.

FredsoNerd commented 3 years ago

In f285c4a5b48b3150188ae1dcb25d0000eabcd06b, we populate a statistics.org file, generated by this script. One might run it as follows:

python3 generate_statistics.py --ownpt openWordnet-PT/data/own-pt-* --ownen openWordnet-PT/data/own-en-* -vv
arademaker commented 3 years ago

Second link is not working

FredsoNerd commented 3 years ago

Sure. It looks like I added a dot after the link. I've just edited the previous comment with the right link:

https://github.com/own-pt/py-ownpt/blob/f5652fddd42565b38f34a5f5ab38c956e92339c6/pyownpt/cli/generate_statistics.py