omwn / omw-data

This packages up data for the Open Multilingual Wordnet
42 stars 3 forks source link

Inconsistent data between NLTK and website #35

Closed AmitMY closed 3 days ago

AmitMY commented 1 year ago

I accessed https://compling.upol.cz/ntumc/cgi-bin/wn-gridx.cgi?gridmode=grid&synset=02934451-n And saw that there are translations, for example in French.

I then tried to get it via code:

  1. Import and download the data
    
    import nltk

nltk.download("wordnet") nltk.download("omw-1.4") nltk.download("extended_omw")

2. What languages are supported?
```py
from nltk.corpus import wordnet as wn
print(wn.langs()) # ['eng', 'als', 'arb', 'bul', 'cmn', 'dan', 'ell', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha']
  1. Get lemmas in all languages
    
    from nltk.corpus import wordnet as wn

Get the synset using its ID

synset = wn.synset_from_pos_and_offset('n', 2960501)

List all available languages in OMW

languages = wn.langs()

Get translations in all languages

translations = {} for lang in languages: lemmas = synset.lemmas(lang=lang) if lemmas: # If there are lemmas for this language translations[lang] = [lemma.name() for lemma in lemmas]

Print translations

for lang, words in translations.items(): print(f"{lang}: {' / '.join(words)}")


Returns:
> eng: car / gondola
cmn: 吊舱
fin: kori
ita: navicella
jpn: ゴンドラ
cat: góndola
eus: ontziska / ontzitxo
spa: barquilla
nld: automobiel / kabelbaan / zweefbaan
pol: gondola / kosz
ron: vagonet

and no French.
goodmami commented 1 year ago

Hi @AmitMY, sorry for the delayed response. When you accessed the OMW website, you found the synset 02934451-n, which has lemmas in French. In the Python code, you search for 02960501-n, which does not. Note that not all synsets have lemmas in all wordnets.

goodmami commented 3 days ago

Closing this as there is no follow-up, no open questions, and the report does not show any bug.