Use Sarcasm v2 Dataset - Githubissues

mubaris / urban-robot

Reddit bot :computer: which replies to sarcastic comments :trollface: :trollface:

33 stars 15 forks source link

Use Sarcasm v2 Dataset #1

Open mubaris opened 6 years ago

mubaris commented 6 years ago

Sarcasm v2 is a better dataset for this project. Since it has both parent comment and reply. Apply this dataset to make the prediction better.

cagdasgerede commented 6 years ago

v2 is a single csv file. I can write a python function to covert that file into the format learn.datasets.load_files expects?

For example, for the following data point:

Corpus,Label,ID,Quote GEN,sarc,GEN_sarc_0000,First off, That's grade A USDA approved Liberalism in a nutshell. GEN,notsarc,GEN_notsarc_1136.First

Programmatically 1) I can create a file GEN_sarc_0000.txt which contains "First off, That's grade A USDA approved Liberalism in a nutshell.". I can create a file GEN_notsarc_1136.txt which contains "First". 2) Then, I can put the file into container/sarc folder and container/notsarc respectively.

This way the current data loading can work as it is.

What do you think about this approach?

mubaris commented 6 years ago

v2 Dataset has columns Quote and Reply. That's why it's better than v1. If we have both parent comment and reply, I think our bot will have better accuracy.

Do not go down the method you proposed.

cagdasgerede commented 6 years ago

It sounds like you are describing a more substantial change. Then what are the steps of achieving what you propose? Since you label this as hacktoberfest, could you provide some more direction?

cromagnonninja commented 6 years ago

Can I work on this issue? What exactly are the problems or concerns regarding this issue at the moment?

mubaris commented 6 years ago

@bhanu1911

Current Method - We generate features from a single text field to train the models.

The desired Method - v2 Dataset provides 2 text field - question and reply to it. We want to make new models based on these 2 inputs.

Hope this helps

cromagnonninja commented 6 years ago

Basically this means we have to start from the ground up - we now have to train a model for the replies too, if I'm not wrong? (I'll study the code and see how you trained the first time around.) Plan of action:

Split the csv file into two parts, quote and reply.
Train and test both post division
Configure the bot to send only those replies which get a reasonably high accuracy from all algorithms. I believe that'll be the way to go?

cromagnonninja commented 6 years ago

Could you guide me as to how you created the dataset?

mubaris commented 6 years ago

@bhanu1911 What I was thinking is little different.

Train the model with 2 inputs - quote and reply.
For a comment to be sarcastic on Reddit, we consider the comment(reply) and its parent comment(quote)

This makes sense because Sarcasm is context based. Having comment and its parent comment will be accurate than a single comment.

mubaris commented 6 years ago

I think the source gives enough background about how they created the dataset - Sarcasm v2

cromagnonninja commented 6 years ago

I meant how did you partition the dataset?