leomrocha / gutenberg_explore

Repository with Gutenberg exploration code, notebooks and Webpage with dynamic data exploration Report Paper/Post
MIT License
0 stars 0 forks source link

Start code structure for gutenberg_explore repository. It should have: #1

Open EvgeniiaVak opened 3 years ago

EvgeniiaVak commented 3 years ago
EvgeniiaVak commented 3 years ago

@leomrocha here is my current understanding of the Gutenberg Explore project with some questions thrown in, please correct where wrong and point to where I can find the answers to the questions

(Mainly based on our call and then on https://github.com/leomrocha/mix_nlp/blob/master/utf8/notebooks/ProjGutenberg_report.ipynb which is probably the prep for the webapp)

The pipepline

  1. the books from Gutenberg are cleaned to be plain text data
  2. some magic happens to split the text into tokens
  3. statistics are gathered about each book - id, language, author, date, overall stats, paragraph stats, sentence stats, etc (and stored as such one row = one book?)
  4. these statistics are stored separately and will be used to power the visualization (what is the approximate size of this data?)

The webapp

leomrocha commented 3 years ago

@EvgeniiaVak Some points Pipeline: 1,2 and 4 are right for #3 what actually happens today is that each output json file corresponds to one book

For the webapp I would like to put the data somewhere to be downloaded, but does not need to be in the webapp, could be a link to a file storage for example