Closed jmccrae closed 1 year ago
what are the benefits of inclusion? Maintainability will be harder in a monolithic resource instead of keeping two resources in sync... maybe.
The goal of this project is to build a wordnet that covers general English as well as possible. As we say on the README, "we welcome contributions" and it is worth the effort of incorporating external resources in spite of the extra work in keeping things up to date, as we have also done for other resources.
I think it would be good to merge it, and then there is no need to keep a separate resource for the taboown.
We can provide scripts for just getting lists of taboo synsets/senses/words if anyone needs them.
On Tue, Jan 19, 2021 at 11:39 PM John McCrae notifications@github.com wrote:
The goal of this project is to build a wordnet that covers general English as well as possible. As we say on the README, "we welcome contributions" and it is worth the effort of incorporating external resources in spite of the extra work in keeping things up to date, as we have also done for other resources.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/globalwordnet/english-wordnet/issues/554#issuecomment-762921095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRR6BZIJXD6P6K46L4LS2WRAJANCNFSM4WIQRKEQ .
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
I think the data should be available but in a way that makes it very obvious what are offensive terms and, ideally, some way of avoiding them. If they are only know to be offensive by having a domain_topic
or exemplifies
relation to something particular, I'm not sure it's obvious enough. We could add a dc:type
attribute, or embed "taboo" in the ID somehow (ewn-taboo-...
). These aren't really better than the relation, but if all are implemented, they are more likely to be noticed.
Another alternative is to make this a WN-LMF-1.1 extension (and, indeed, this was one of the motivating use cases of that feature). In Wn, at least, the plan is that querying an extension lexicon behaves as if it and the original resource were one. Users can exclude the extensions from queries, if desired.
I think it would be good to merge it, and then there is no need to keep a separate resource for the taboown.
The extension solution would require a separate WN-LMF file, but this project (EWN) could still manage both. The release scripts just need to be updated to produce the extension file separately.
Yes, obviously an exemplifies
link would be necessary for most of these terms.
Another option is that we could create new lexicographer files marking these as taboo forms.
I believe we need to address one of the big issues in wordnets maintenance: how to work efficiently with distributed, complementary and mutually dependent wordnets. I like the idea of exploring the XML-LMF extension feature.
As @goodmami suggested, the Princeton "topic domain" relation (the ";c" and "-c" pointers) could be an appropriate way to link particular words to the (currently) six categories (offensive, slur, etc...) from Taboo WordNet. Another, maybe even more adequate possibility could be the "usage domain" relation (the ";u" and "-u" pointers), which includes many examples of "ethnic slur", "colloquialism", "slang", "disparagement", etc...
@fcbond and team have developed a new resource called Taboo WordNet.
https://github.com/bond-lab/taboown/tree/main/data
Assuming that we are happy to include offensive terms (see #151) then we should include this resource also within EWN