ml5js / ml5-data-and-models

Data sets and pre-trained models for ml5.js
https://ml5js.org/docs/data-overview
MIT License
125 stars 98 forks source link

Adding JK Rowling model and data for Harry Potter Books #34

Closed msbrown closed 5 years ago

msbrown commented 5 years ago

Adding data with text for all the Harry Potter books (cleaned up from Project Guttenberg) and JKRowling model trained on Harry Potter text. LSTM model was trained with the following parameters --rnn_size 512 --num_layers 2 --seq_length 128 --batch_size 64 --dropout 0.25

cvalenzuela commented 5 years ago

great, thanks!

shiffman commented 5 years ago

Hey! This is wonderful, but I'm not sure we can include it due to copyright? We discussed this a bit hin #9. Or am I mistaken and Harry Potter is fair game? Is it on Project Gutenberg?

cvalenzuela commented 5 years ago

ups! that's true! We can still keep the model but not the source text. no?

shiffman commented 5 years ago

I think this is a grey area and a super interesting question! Can we publish a model trained on text not in the public domain? I think for the ml5 project we probably should err on the conservative side and not include any models trained on text we don't have the rights to? This isn't a legal opinion by any means of course and doesn't preclude independent projects making use of other models!

cvalenzuela commented 5 years ago

Ok, sounds good. We can make a cleanup with https://github.com/ml5js/ml5-data-and-models/issues/30 and only keep models that where trained on text we have the rights

msbrown commented 5 years ago

All of the books are listed on the archive.org in their opensource collection (in case that helps): https://archive.org/details/welcometohogwarts & https://archive.org/details/opensource They were listed with a no copyright tag: https://creativecommons.org/publicdomain/zero/1.0/

msbrown commented 5 years ago

Another consideration on copyright: Should the policy be across the board? If so, then that may have implications for images used in Styletransfer (and/or sourcing).

shiffman commented 5 years ago

Yes, I was thinking this as well. I am not sure how to best approach this but yes I believe that any datasets we use for training (images, text, etc.) should have an appropriate license. This likely affects the pix2pix models in particular, let's think about this and discuss at our next meeting?

cvalenzuela commented 5 years ago

Sounds good!