Start code structure for gutenberg_explore repository. It should have:

EvgeniiaVak commented 3 years ago

notebooks
report ( all that is needed for the webpage goes here)
src (python code)
tests
docs (where the live website is per github pages)
Maybe Scala src folder for Spark NLP pipeline (future)
Documentation
other things (TBD)

EvgeniiaVak commented 3 years ago

@leomrocha here is my current understanding of the Gutenberg Explore project with some questions thrown in, please correct where wrong and point to where I can find the answers to the questions

(Mainly based on our call and then on https://github.com/leomrocha/mix_nlp/blob/master/utf8/notebooks/ProjGutenberg_report.ipynb which is probably the prep for the webapp)

The pipepline

the books from Gutenberg are cleaned to be plain text data
some magic happens to split the text into tokens
statistics are gathered about each book - id, language, author, date, overall stats, paragraph stats, sentence stats, etc (and stored as such one row = one book?)
these statistics are stored separately and will be used to power the visualization (what is the approximate size of this data?)

The webapp

Shows the refined report
Allows to explore (and download?) the statistics from step 4 in the pipeline

leomrocha commented 3 years ago

@EvgeniiaVak Some points Pipeline: 1,2 and 4 are right for #3 what actually happens today is that each output json file corresponds to one book

For the webapp I would like to put the data somewhere to be downloaded, but does not need to be in the webapp, could be a link to a file storage for example

leomrocha / gutenberg_explore

Start code structure for gutenberg_explore repository. It should have: #1

The pipepline

The webapp