Create a new release with some improvements (1.5)

omwn / omw-data

This packages up data for the Open Multilingual Wordnet

43 stars 3 forks source link

Create a new release with some improvements (1.5) #31

Open fcbond opened 1 year ago

fcbond commented 1 year ago

remove synsets with deprecated ilis (move translations from superseded concepts)
- take from merges in oewn
- identify British (and other variants), e.g. moke British informal for donkey; look also at Domain–Region: united_kingdom (and other countries)
- add new omw formatted wordnets?
- merge with TUFS data https://github.com/fcbond/tufs @ArthurBond
- add new MCR data from @ekaf #25

fcbond commented 1 year ago

We could go two ways with synsets like moke "British informal for donkey"

link it with ir_synonym and make sure both sides have the same translations
merge, and mark the senses with the dialect and register tags
- so moke is in donkey but marked with Domain-Region united_kingdom and exemplifies informal

ekaf commented 1 year ago

take from merges in oewn

@fcbond, this sounds ambiguous, and may not be optimal: merges are relative to a target English Wordnet version, so you would for ex. pick either OEWN 2021 or 2022, and then deal with different merges in later OEWN versions? It might be better not to handle the merges in OMW-data: NLTK now handles OMW merges seamlessly with any OEWN version, and @goodmami might eventually consider a similar approach in Wn for solving the related issue https://github.com/goodmami/wn/issues/179

arademaker commented 1 year ago

merge, and mark the senses with the dialect and register tags so moke is in donkey but marked with Domain-Region united_kingdom and exemplifies informal

I prefer this option

goodmami commented 1 year ago

Also consider fixing #32 for this release.

@goodmami might eventually consider a similar approach in Wn for solving the related issue https://github.com/goodmami/wn/issues/179

The issue is no longer fresh in my mind, but I don't think I was planning on making any significant changes to Wn. More likely I would suggest some documentation about how to deal with such merges, such as using the code snippet I wrote in that issue. But I should first check out how it was handled in the NLTK.

goodmami commented 1 month ago

If a 1.5 version is still on the agenda, let's consider adding pre-3.0 versions of the Princeton WordNet data (see https://github.com/goodmami/wn/issues/199).

fcbond commented 2 weeks ago

I am thinking I will probably not try to do too much here: identifying variants should really be done in the language project (so in OEWN for English).

These are the minimum I would like to see for this:

[ ] Get more out of the MCR (done, thanks @ekaf)
[ ] Remove various duplicates
[ ] Add confidence to the OMW built XML (need for OMW 2.0)
[ ] Add earlier PWNs
[ ] Move to wn 0.9.5 (@goodmami )
[ ] Extend tsv2lmf.py to deal with variants, counts and pronunciation (need for TUFS), almost done
[ ] show the release summary
[ ] Maybe add the other French Wordnet?

Most of these are close to done, I need to push out for review, ...