omwn / omw-data

This packages up data for the Open Multilingual Wordnet
38 stars 3 forks source link

How can I add a new language? #34

Open nguyenlamlll opened 1 year ago

nguyenlamlll commented 1 year ago

Hi everyone, please pardon me if I post this question in the wrong place or if this sounds very.. beginner. I have just started using wn and researched the languages. (Also a beginner in Python, too).

My question is, can I add a new language? And, how should I add it? For example, my native language is Vietnamese. It is not in the list of available Wordnets here (or in goodmami/wn). I try to search on Google, too, but I cannot find any available Wordnet in my language.

I tried to look into scripts folder, too. But I could not fully understand them. So.. again, if I were to add a new language, how should I do it? Because eventually, I think I should be able to query it with wn, so the how part is important, I guess.

Thank you very much.

goodmami commented 1 year ago

Hi @nguyenlamlll,

my native language is Vietnamese. It is not in the list of available Wordnets here (or in goodmami/wn).

Practically, I think this is true. However the wns/cldr/ subdirectory contains many more wordnets, albeit with smaller size and lower quality, and it has https://github.com/omwn/omw-data/blob/main/wns/cldr/wn-cldr-vie.tab for Vietnamese. The CLDR wordnets are not released and indexed by the Wn package so you'd need to build it yourself. It's not very useful as it is, but it might serve as a starting point if you wish to work on building out a Vietnamese wordnet (which would be very welcome).

My question is, can I add a new language? And, how should I add it?

If you were to build a Vietnamese wordnet and wanted it included in OMW, it would need to be under a permissive open license. @fcbond could tell you more, as well as about the how.

fcbond commented 1 year ago

G'day,

as @goodmami says, there is an automatically created dictionary and a script to make a wordnet from the tab file. It requires python 3.9, and the right version of the wn module, so I would suggest you make a virtual environment. Also, if you want to have the ili mapped, you need to down load the ili-map from https://github.com/globalwordnet/cili/blob/master/ili-map-pwn30.tab

Then you can do something like (with the right paths):

$ python3.9  -m venv venv
$ source venv/bin/activate
(venv) $ pip install -r requirements.txt
...
(venv) $ python3.9  scripts/tsv2lmf.py  \
--id wnvie --label "Vietnamese Wordnet from Wiktionary" \
--language vi 
--email bond@ieee.org --license "https://creativecommons.org/licenses/by-sa/" 
--version "1.0"  --meta confidence=0.9 \
--ili-map=/home/bond/git/cili/ili-map-pwn30.tab  \
wns/wikt/wn-wikt-vie.tab wnvie.xml
...

I hope this is detailed enough.

You can then use it in wn

>>> import wn
>>> wn.add("wnvie.xml")
Added wnvie:1.0 (Vietnamese Wordnet from Wiktionary)
>>> wn.synsets(ili='i69544', lang='vi')[0].lemmas()
['lời', 'những lời', 'nhời', 'từ', 'tiếng']
>>> wn.synsets(ili='i69544', lang='en')[0].lemmas()
['word']

For the file to become part of the omw-data release, we would want to have it validated, that is, have a human check the entries and confirm that they are correct for each synset.
If you were willing to do that for all entries (or some substantial subset) that would be fantastic. What do you think?

Even better, you might want to start your own project, and we could point to that :-). What do you think?

fcbond commented 1 year ago

In fact, there are two automatically built lexicons (one from the CLDR data @goodmami mentioned and the one from wiktionary). The first is smaller (country names, language names and some time/date expressions) but generally more accurate. You can easily merge them both and then make the lexicon.

nguyenlamlll commented 1 year ago

Thank you two very much! I'm a total beginner in this area so I really appreciate that you took time and effort to answer me in detail!

Indeed, the tab file wn-cldr-vie.tab that you suggested is really simple and incomplete. But thanks. It is a good starting point for me. And, a tab format is the one that I can produce with Excel? Or is it a special format? Sorry, I'm not familiar with this .tab file but I see that Excel can open it.

And thank you, @fcbond. If I get it right, tsv2lmf.py helps me convert the tab file into XML file that wn can understand. Am I correct?

I see that I can download the viwiktionary database from this link. Particularly, I think viwiktionary-latest-pages-articles.xml.bz2. So, supposedly, my work pipeline to produce a wn's dataset is:

XML file from Wiktionary ---> Tab file ---(use tsv2lmf.py)---> XML file of wn ----> load it with wn.add(...)

Am I getting it correctly?

fcbond commented 1 year ago

G'day,

On Wed, 3 May 2023 at 13:39, Lam Le @.***> wrote:

Thank you two very much! I'm a total beginner in this area so I really appreciate that you took time and effort to answer me in detail!

Indeed, the tab file wn-cldr-vie.tab https://github.com/omwn/omw-data/blob/main/wns/cldr/wn-cldr-vie.tab that you suggested is really simple and incomplete. But thanks. It is a good starting point for me. And, a tab format is the one that I can produce with Excel? Or is it a special format? Sorry, I'm not familiar with this .tab file but I see that Excel can open it.

You can edit it with Excel, or with a text editor.

And thank you, @fcbond https://github.com/fcbond. If I get it right,

tsv2lmf.py helps me convert the tab file into XML file that wn can understand. Am I correct?

I see that I can download the viwiktionary database from this link https://dumps.wikimedia.org/viwiktionary/latest/. Particularly, I think viwiktionary-latest-pages-articles.xml.bz2 https://dumps.wikimedia.org/viwiktionary/latest/viwiktionary-latest-pages-articles.xml.bz2 . So, supposedly, my work pipeline to produce a wn's dataset is:

XML file from Wiktionary ---> Tab file ---(use tsv2lmf.py)---> XML file of wn ----> load it with wn.add(...)

Unfortunately, the step XML file from Wiktionary ---> Tab file is much more complicated (see https://aclanthology.org/P13-1133/) so I recommend you just use the file(s) we have produced.

There is also another file here: https://github.com/fcbond/tufs/blob/master/omw_format/tufs-vocab-vi.tsv

So I would recommend.

Merge CLDR, wikt and tufs tab files -> check and make corrections -> (use tsv2lmf.py)---> XML file of wn ----> load it with wn.add(...)

Am I getting it correctly?

Close.

-- Francis Bond https://fcbond.github.io/