Comedian neural network research request

iaroslav-ai commented 8 years ago

Create a deep neural network capable of generating funny jokes.

ilyasu123 commented 8 years ago

Scraping: add a comment in the instructions to obey robots.txt or whatever request that the website has with respect to scraping. I heard that reddit has exposed a few terabytes of reddit comments for download -- you should advice to get that. In fact, I believe that your problem will have a much greater chance of being solved if you preprocess the data yourself!

Step 3 cannot be done with today's technology. The network needs to output text. Text is discrete. So unless you implement a policy gradient algorithm (which is hard but probably doable), this step will just be impossible. I suggest that you request to train a more conventional language model, like the one in https://arxiv.org/abs/1602.02410. I also recommend that you do not include a critic network.

Likewise, I recommend providing clarity regarding step 4. The easiest approach is to use Mechanical Turk, which is not free. One possibility is to create a twitter account.

For this task, training a language model is a much more promising approach than using the stochastic neural network.

I recommend removing the requirement to keep track of previous jokes. To summarize: the goal should be to train the biggest baddest LSTM on jokes.

To improve the LSTM and to make it more interesting, it would be nice if for each joke, you could extract a number of keywords that describe the topic of the joke, or better yet, genre, and train the LSTM to maximize the likelihood of the joke given its topic and genre. If you preprocess the data yourself and put it online someplace, the chance that someone will solve this problem will increase substantially.

iaroslav-ai commented 8 years ago

Ok, thanks for informative feedback :) I will make changes accordingly. I think I even saw some relevant datasets around when I was making a background check.

ilyasu123 commented 8 years ago

Any updates?

iaroslav-ai commented 8 years ago

Sorry for the delay. I will push updates in a day or so.

ilyasu123 commented 8 years ago

Looks decent. Will edit and merge later today.

ilyasu123 commented 8 years ago

Right now the problem is to collect a dataset and then to apply char-rnn to it.

To make it more interesting, I'd add:

train a big RNN on a large corpus of text (wikipedia or books), then fine-tune it on jokes. Does the resulting model produce better jokes than an RNN that was trained on jokes only?
is there a way to find a large collection of jokes and their topic? Then, create a dataset where each joke starts with a few words describing its topic. It's a good idea b/c this way, once you train the model, you'll be able to generate jokes on any topic that you see fit.
[Optional: the language in the request for research should be more formal, but it is something I could do after the merge]

iaroslav-ai commented 8 years ago

Yeah, char-rnn seems to be like the easiest thing so far that could be used for the outlined task.

right, sort of like using imagenet CNN classifier features for other tasks, sounds interesting :)
I did not include this to keep things simple, but I will include this also
I can give it a try to make it more formal, so as to reduce load on your side

I will make changes now according to above comment. If this can cause conflicts of any sort, let me know

iaroslav-ai commented 8 years ago

Let me know if there are any additional changes you would like me to make.

ilyasu123 commented 8 years ago

Will merge tomorrow and make a few edits.

ilyasu123 commented 8 years ago

So I've merged it but then realized that it's not ready yet. To make the question really good, you need to download the appropriate dataset from r/jokes. It's a good idea because otherwise everyone who will attempt to solve the problem will have to redo the gruntwork of getting the dataset. It's important to do because the small datasets that you've linked to are too small for training a reasonable language model. I know that there are tools for doing so (like https://github.com/NSchrading/redditDataExtractor). There is also https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/, which allows you to download all of reddit without using a scraper.

Do that, and the problem will become extremely good.

iaroslav-ai commented 8 years ago

I will take a look at it. As I will have the dataset available, I will create a separate merge, where I will either give a link to dataset or maybe torrent if it will be big (but my guess would be that it is no more than 10 Gb when compressed).

ilyasu123 commented 8 years ago

Sounds good, looking forward!

On Mon, Jul 4, 2016 at 11:18 AM iaroslav-ai notifications@github.com wrote:

I will take a look at it. As I will have the dataset available, I will create a separate merge, where I will either give a link to dataset or maybe torrent if it will be big (but my guess would be that it is no more than 10 Gb when compressed).

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/openai/requests-for-research/pull/11#issuecomment-230336774, or mute the thread https://github.com/notifications/unsubscribe/AGibuVcdtaX5RBwh6gFM6Qr0zJsF_44Nks5qSU5mgaJpZM4IyehC .

openai / requests-for-research

Comedian neural network research request #11