globalwordnet / english-wordnet

The Open English WordNet
https://en-word.net/
Other
440 stars 52 forks source link

Taboo WordNet #554

Closed jmccrae closed 1 year ago

jmccrae commented 3 years ago

@fcbond and team have developed a new resource called Taboo WordNet.

https://github.com/bond-lab/taboown/tree/main/data

Assuming that we are happy to include offensive terms (see #151) then we should include this resource also within EWN

arademaker commented 3 years ago

what are the benefits of inclusion? Maintainability will be harder in a monolithic resource instead of keeping two resources in sync... maybe.

jmccrae commented 3 years ago

The goal of this project is to build a wordnet that covers general English as well as possible. As we say on the README, "we welcome contributions" and it is worth the effort of incorporating external resources in spite of the extra work in keeping things up to date, as we have also done for other resources.

fcbond commented 3 years ago

I think it would be good to merge it, and then there is no need to keep a separate resource for the taboown.

We can provide scripts for just getting lists of taboo synsets/senses/words if anyone needs them.

On Tue, Jan 19, 2021 at 11:39 PM John McCrae notifications@github.com wrote:

The goal of this project is to build a wordnet that covers general English as well as possible. As we say on the README, "we welcome contributions" and it is worth the effort of incorporating external resources in spite of the extra work in keeping things up to date, as we have also done for other resources.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/english-wordnet/issues/554#issuecomment-762921095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRR6BZIJXD6P6K46L4LS2WRAJANCNFSM4WIQRKEQ .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 3 years ago

I think the data should be available but in a way that makes it very obvious what are offensive terms and, ideally, some way of avoiding them. If they are only know to be offensive by having a domain_topic or exemplifies relation to something particular, I'm not sure it's obvious enough. We could add a dc:type attribute, or embed "taboo" in the ID somehow (ewn-taboo-...). These aren't really better than the relation, but if all are implemented, they are more likely to be noticed.

Another alternative is to make this a WN-LMF-1.1 extension (and, indeed, this was one of the motivating use cases of that feature). In Wn, at least, the plan is that querying an extension lexicon behaves as if it and the original resource were one. Users can exclude the extensions from queries, if desired.

I think it would be good to merge it, and then there is no need to keep a separate resource for the taboown.

The extension solution would require a separate WN-LMF file, but this project (EWN) could still manage both. The release scripts just need to be updated to produce the extension file separately.

jmccrae commented 3 years ago

Yes, obviously an exemplifies link would be necessary for most of these terms.

Another option is that we could create new lexicographer files marking these as taboo forms.

arademaker commented 3 years ago

I believe we need to address one of the big issues in wordnets maintenance: how to work efficiently with distributed, complementary and mutually dependent wordnets. I like the idea of exploring the XML-LMF extension feature.

ekaf commented 3 years ago

As @goodmami suggested, the Princeton "topic domain" relation (the ";c" and "-c" pointers) could be an appropriate way to link particular words to the (currently) six categories (offensive, slur, etc...) from Taboo WordNet. Another, maybe even more adequate possibility could be the "usage domain" relation (the ";u" and "-u" pointers), which includes many examples of "ethnic slur", "colloquialism", "slang", "disparagement", etc...