globalwordnet / english-wordnet

The Open English WordNet
https://en-word.net/
Other
454 stars 54 forks source link

Open Access Reference corpus #756

Open jmccrae opened 2 years ago

jmccrae commented 2 years ago

The current guidelines for new synsets, state that the lemma must have at least 100 occurrences in Sketch Engines's TenTen corpus.

https://github.com/globalwordnet/english-wordnet/blob/master/NEW_SYNSETS.md

This corpus is only accessible to paying Sketch Engine customers and so does not really fit with our open-source goals. We should update this to an open access corpus such as the American National Corpus.

Any suggestions?

arademaker commented 2 years ago
  1. EWT and Ontonotes from https://github.com/propbank/propbank-release
  2. English UD corpora
jmccrae commented 1 year ago

Thanks @arademaker, both of those corpora are quite small and I don't think they would suit our needs.

@fcbond has suggested the use of the CoCA corpus and I think this seems quite suitable.