r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

List of News Sources #44

Open lintangsutawika opened 9 months ago

lintangsutawika commented 9 months ago

Similar to #17

CC BY

CC BY-SA

Public Domain

lintangsutawika commented 9 months ago

https://wiki.creativecommons.org/wiki/journalism

StellaAthena commented 8 months ago

MOT corpus is very small (35M tokens of English, less than 1B tokens total) but worth mentioning for completeness

lintangsutawika commented 8 months ago

ProPublica, DemocracyNow, and The Conversation use CC BY-NC-ND 3.0 Agência Pública and Mongabay use CC BY-ND Tasnim uses CC BY 4.0 OpenDemocray only says it uses CC

@StellaAthena

StellaAthena commented 8 months ago

@lintangsutawika OpenDemocracy is CC-BY-NC if you go look at the individual articles.

So it seems that Tasnim is the only source that has a sufficiently permissive license.

lintangsutawika commented 8 months ago

That's too bad, hoping to balance Tasnim with news source biased towards its opposite. Without news sites, it would be hard to train a model with current/recent events.

lintangsutawika commented 8 months ago

Found a site that indexes news and also has tags on what specific CC License they use. https://opennewswire.org/feed/