ml5js / ml5-data-and-models

Data sets and pre-trained models for ml5.js
https://ml5js.org/docs/data-overview
MIT License
125 stars 98 forks source link

Diverse LSTM models #9

Closed shiffman closed 6 years ago

shiffman commented 6 years ago

Going through our materials I noticed that our example LSTM models are three deceased white men. I propose we include additional LSTM models that represent more diverse authors (specifically women and people of color).

We also could consider LSTM models built from multiple authors (news data, terms of service documents, etc.) I'm happy to work on training some models to add, I would love suggestions in this thread!

(There's also an interesting question about copyright. If we are publishing a model trained on text not in the public domain what copyright issues does this raise?)

cvalenzuela commented 6 years ago

Totally!

The training should also be easier since we are now using this: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py

shiffman commented 6 years ago

Noting some authors I've found on Project Gutenberg here (choices are arbitrary)

Zora Neale Hurston http://www.gutenberg.org/ebooks/author/6368 Virginia Woolf: http://www.gutenberg.org/ebooks/author/89 Mary Shelley: http://www.gutenberg.org/ebooks/author/61 Langston Hughes: https://www.poemhunter.com/langston-hughes/poems/ (not sure if public domain)

I'll also note that the "Bookshelf" page is a good place to start:

https://www.gutenberg.org/wiki/Category:Bookshelf e.g. https://www.gutenberg.org/wiki/African_American_Writers_(Bookshelf)

There are also books organized by genre/topic. For example we could make a pre-trained "recipe/cooking" model:

https://www.gutenberg.org/wiki/Cookery_(Bookshelf)

nikitahuggins commented 6 years ago

I'm working to train W.E.B Dubois. I found a few of his publications on Project Gutenberg http://www.gutenberg.org/ebooks/search/?query=dubois

I also found some poems from Maya Angelou https://www.poemhunter.com/maya-angelou/poems/ -- same, not sure if it's public domain

shiffman commented 6 years ago

Just a note here that I am going to attempt to train a model with Zora Neale Hurston text, will post updates.

shiffman commented 6 years ago

@cvalenzuela do you have a Virginia Woolf model already from Selected Stories we can add here? (I am about to make one.)

cvalenzuela commented 6 years ago

I do! I also have a Bolaño too! will update them now

cvalenzuela commented 6 years ago

see https://github.com/ml5js/ml5-data-and-training/pull/17

shiffman commented 6 years ago

Complete for now. This is the current list: Roberto Bolaño, Charles Darwin, W.E.B. Dubois, Ernest Hemingway, William Shakespeare, Mary Shelley, Zora Neale Hurston, and Virginia Wolf.