Open lintangsutawika opened 10 months ago
MOT corpus is very small (35M tokens of English, less than 1B tokens total) but worth mentioning for completeness
ProPublica, DemocracyNow, and The Conversation use CC BY-NC-ND 3.0
Agência Pública and Mongabay use CC BY-ND
Tasnim uses CC BY 4.0
OpenDemocray only says it uses CC
@StellaAthena
@lintangsutawika OpenDemocracy is CC-BY-NC if you go look at the individual articles.
So it seems that Tasnim is the only source that has a sufficiently permissive license.
That's too bad, hoping to balance Tasnim with news source biased towards its opposite. Without news sites, it would be hard to train a model with current/recent events.
Found a site that indexes news and also has tags on what specific CC License they use. https://opennewswire.org/feed/
Similar to #17
ProPublicaDemocracyNowThe ConversationAgência PúblicaOpenDemocracyMongabayCC BY
CC BY-SA
Public Domain