Open nguyenlamlll opened 1 year ago
Hi @nguyenlamlll,
my native language is Vietnamese. It is not in the list of available Wordnets here (or in goodmami/wn).
Practically, I think this is true. However the wns/cldr/ subdirectory contains many more wordnets, albeit with smaller size and lower quality, and it has https://github.com/omwn/omw-data/blob/main/wns/cldr/wn-cldr-vie.tab for Vietnamese. The CLDR wordnets are not released and indexed by the Wn package so you'd need to build it yourself. It's not very useful as it is, but it might serve as a starting point if you wish to work on building out a Vietnamese wordnet (which would be very welcome).
My question is, can I add a new language? And, how should I add it?
If you were to build a Vietnamese wordnet and wanted it included in OMW, it would need to be under a permissive open license. @fcbond could tell you more, as well as about the how.
G'day,
as @goodmami says, there is an automatically created dictionary and a script to make a wordnet from the tab file. It requires python 3.9
, and the right version of the wn
module, so I would suggest you make a virtual environment. Also, if you want to have the ili mapped, you need to down load the ili-map from https://github.com/globalwordnet/cili/blob/master/ili-map-pwn30.tab
Then you can do something like (with the right paths):
$ python3.9 -m venv venv
$ source venv/bin/activate
(venv) $ pip install -r requirements.txt
...
(venv) $ python3.9 scripts/tsv2lmf.py \
--id wnvie --label "Vietnamese Wordnet from Wiktionary" \
--language vi
--email bond@ieee.org --license "https://creativecommons.org/licenses/by-sa/"
--version "1.0" --meta confidence=0.9 \
--ili-map=/home/bond/git/cili/ili-map-pwn30.tab \
wns/wikt/wn-wikt-vie.tab wnvie.xml
...
I hope this is detailed enough.
You can then use it in wn
>>> import wn
>>> wn.add("wnvie.xml")
Added wnvie:1.0 (Vietnamese Wordnet from Wiktionary)
>>> wn.synsets(ili='i69544', lang='vi')[0].lemmas()
['lời', 'những lời', 'nhời', 'từ', 'tiếng']
>>> wn.synsets(ili='i69544', lang='en')[0].lemmas()
['word']
For the file to become part of the omw-data
release, we would want to have it validated, that is, have a human check the entries and confirm that they are correct for each synset.
If you were willing to do that for all entries (or some substantial subset) that would be fantastic. What do you think?
Even better, you might want to start your own project, and we could point to that :-). What do you think?
In fact, there are two automatically built lexicons (one from the CLDR data @goodmami mentioned and the one from wiktionary). The first is smaller (country names, language names and some time/date expressions) but generally more accurate. You can easily merge them both and then make the lexicon.
Thank you two very much! I'm a total beginner in this area so I really appreciate that you took time and effort to answer me in detail!
Indeed, the tab file wn-cldr-vie.tab that you suggested is really simple and incomplete. But thanks. It is a good starting point for me. And, a tab format is the one that I can produce with Excel? Or is it a special format? Sorry, I'm not familiar with this .tab
file but I see that Excel can open it.
And thank you, @fcbond. If I get it right, tsv2lmf.py
helps me convert the tab file into XML file that wn
can understand. Am I correct?
I see that I can download the viwiktionary database from this link. Particularly, I think viwiktionary-latest-pages-articles.xml.bz2.
So, supposedly, my work pipeline to produce a wn
's dataset is:
XML file from Wiktionary ---> Tab file ---(use tsv2lmf.py)---> XML file of wn
----> load it with wn.add(...)
Am I getting it correctly?
G'day,
On Wed, 3 May 2023 at 13:39, Lam Le @.***> wrote:
Thank you two very much! I'm a total beginner in this area so I really appreciate that you took time and effort to answer me in detail!
Indeed, the tab file wn-cldr-vie.tab https://github.com/omwn/omw-data/blob/main/wns/cldr/wn-cldr-vie.tab that you suggested is really simple and incomplete. But thanks. It is a good starting point for me. And, a tab format is the one that I can produce with Excel? Or is it a special format? Sorry, I'm not familiar with this .tab file but I see that Excel can open it.
You can edit it with Excel, or with a text editor.
And thank you, @fcbond https://github.com/fcbond. If I get it right,
tsv2lmf.py helps me convert the tab file into XML file that wn can understand. Am I correct?
I see that I can download the viwiktionary database from this link https://dumps.wikimedia.org/viwiktionary/latest/. Particularly, I think viwiktionary-latest-pages-articles.xml.bz2 https://dumps.wikimedia.org/viwiktionary/latest/viwiktionary-latest-pages-articles.xml.bz2 . So, supposedly, my work pipeline to produce a wn's dataset is:
XML file from Wiktionary ---> Tab file ---(use tsv2lmf.py)---> XML file of wn ----> load it with wn.add(...)
Unfortunately, the step XML file from Wiktionary ---> Tab file is much more complicated (see https://aclanthology.org/P13-1133/) so I recommend you just use the file(s) we have produced.
There is also another file here: https://github.com/fcbond/tufs/blob/master/omw_format/tufs-vocab-vi.tsv
So I would recommend.
Merge CLDR, wikt and tufs tab files -> check and make corrections -> (use tsv2lmf.py)---> XML file of wn ----> load it with wn.add(...)
Am I getting it correctly?
Close.
-- Francis Bond https://fcbond.github.io/
Hi everyone, please pardon me if I post this question in the wrong place or if this sounds very.. beginner. I have just started using
wn
and researched the languages. (Also a beginner in Python, too).My question is, can I add a new language? And, how should I add it? For example, my native language is Vietnamese. It is not in the list of available Wordnets here (or in goodmami/wn). I try to search on Google, too, but I cannot find any available Wordnet in my language.
I tried to look into
scripts
folder, too. But I could not fully understand them. So.. again, if I were to add a new language, how should I do it? Because eventually, I think I should be able to query it withwn
, so thehow
part is important, I guess.Thank you very much.