Closed lintangsutawika closed 5 months ago
Am I correct in assuming that this is related to https://github.com/r-three/licensed-pile/issues/44? Can you list in the PR which sources you have included / which you plan on doing?
Yes it's related. I'm planning to get through all the website mentioned. Currently have Propublica and Democracy Now.
Can you add the (estimated) number of documents and number of tokens to the README?
What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid?
What would be the best way to estimate number of tokens? Would sampling a number of pages and calculating the tokens from there be valid?
Yup!
Sorry for the delay. Will get to it this week.
@lintangsutawika any bandwidth to finish this up?
I made all the changes I had talked about and I separated the scraping steps to make the code a bit simpler in this PR https://github.com/r-three/licensed-pile/pull/68. We should be able to merge that one
Development moved to this branch for easier colab https://github.com/r-three/licensed-pile/pull/68
Sites included. Source: https://opennewswire.org/feed/
CC BY
CC BY-SA
Public Domain
Caravanserai
Voice of America intentionally left out so that it does not overlap with MOT.