sgraaf / Replicate-Toronto-BookCorpus

This repository contains code to replicate the no-longer publicly available Toronto BookCorpus dataset
GNU General Public License v3.0
48 stars 12 forks source link

Question: What about Project Gutenberg as an alternative source? #5

Closed ghost closed 4 years ago

ghost commented 4 years ago

https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages

sgraaf commented 4 years ago

I appreciate the suggestion! I looked into this myself previously (per Google's suggestion), but consider it beyond the scope of this repository. I say this because this repository serves as a means of creating a faithful replica of the original TBC dataset, which cannot be accomplished with books from Project Gutenberg (mostly because the books on Project Gutenberg are (understandably) old, and thus their writing styles are old/dated).