News - Githubissues

r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data

MIT License

22 stars 6 forks source link

News #45

Closed lintangsutawika closed 5 months ago

lintangsutawika commented 9 months ago

Sites included. Source: https://opennewswire.org/feed/

CC BY

360info
Africa is a Country
Alt News
Balkan Diskurs
Factly
Freedom of the Press Foundation
Agenzia Fides
Global Voices
Meduza
Mekong Eye
Milwaukee Neighborhood News Service
Minority Africa
New Canadian Media
SciDev.Net
The Solutions Journalism Exchange
Tasnim News Agency
ZimFact

CC BY-SA

Liberty TV
Oxpeckers
Propastop
The Public Record

Public Domain

Caravanserai

Voice of America intentionally left out so that it does not overlap with MOT.

StellaAthena commented 9 months ago

Am I correct in assuming that this is related to https://github.com/r-three/licensed-pile/issues/44? Can you list in the PR which sources you have included / which you plan on doing?

lintangsutawika commented 9 months ago

Yes it's related. I'm planning to get through all the website mentioned. Currently have Propublica and Democracy Now.

StellaAthena commented 9 months ago

Can you add the (estimated) number of documents and number of tokens to the README?

lintangsutawika commented 9 months ago

What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid?

StellaAthena commented 9 months ago

What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid?

Yup!

lintangsutawika commented 7 months ago

Sorry for the delay. Will get to it this week.

craffel commented 5 months ago

@lintangsutawika any bandwidth to finish this up?

blester125 commented 5 months ago

I made all the changes I had talked about and I separated the scraping steps to make the code a bit simpler in this PR https://github.com/r-three/licensed-pile/pull/68. We should be able to merge that one

blester125 commented 5 months ago

Development moved to this branch for easier colab https://github.com/r-three/licensed-pile/pull/68