Closed shiffman closed 6 years ago
Totally!
The training should also be easier since we are now using this: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py
Noting some authors I've found on Project Gutenberg here (choices are arbitrary)
Zora Neale Hurston http://www.gutenberg.org/ebooks/author/6368 Virginia Woolf: http://www.gutenberg.org/ebooks/author/89 Mary Shelley: http://www.gutenberg.org/ebooks/author/61 Langston Hughes: https://www.poemhunter.com/langston-hughes/poems/ (not sure if public domain)
I'll also note that the "Bookshelf" page is a good place to start:
https://www.gutenberg.org/wiki/Category:Bookshelf e.g. https://www.gutenberg.org/wiki/African_American_Writers_(Bookshelf)
There are also books organized by genre/topic. For example we could make a pre-trained "recipe/cooking" model:
I'm working to train W.E.B Dubois. I found a few of his publications on Project Gutenberg http://www.gutenberg.org/ebooks/search/?query=dubois
I also found some poems from Maya Angelou https://www.poemhunter.com/maya-angelou/poems/ -- same, not sure if it's public domain
Just a note here that I am going to attempt to train a model with Zora Neale Hurston text, will post updates.
@cvalenzuela do you have a Virginia Woolf model already from Selected Stories we can add here? (I am about to make one.)
I do! I also have a Bolaño too! will update them now
Complete for now. This is the current list: Roberto Bolaño, Charles Darwin, W.E.B. Dubois, Ernest Hemingway, William Shakespeare, Mary Shelley, Zora Neale Hurston, and Virginia Wolf.
Going through our materials I noticed that our example LSTM models are three deceased white men. I propose we include additional LSTM models that represent more diverse authors (specifically women and people of color).
We also could consider LSTM models built from multiple authors (news data, terms of service documents, etc.) I'm happy to work on training some models to add, I would love suggestions in this thread!
(There's also an interesting question about copyright. If we are publishing a model trained on text not in the public domain what copyright issues does this raise?)