mubaris / urban-robot

Reddit bot :computer: which replies to sarcastic comments :trollface: :trollface:
33 stars 15 forks source link

Use Sarcasm v2 Dataset #1

Open mubaris opened 6 years ago

mubaris commented 6 years ago

Sarcasm v2 is a better dataset for this project. Since it has both parent comment and reply. Apply this dataset to make the prediction better.

cagdasgerede commented 6 years ago

v2 is a single csv file. I can write a python function to covert that file into the format learn.datasets.load_files expects?

For example, for the following data point:

Corpus,Label,ID,Quote GEN,sarc,GEN_sarc_0000,First off, That's grade A USDA approved Liberalism in a nutshell. GEN,notsarc,GEN_notsarc_1136.First

Programmatically 1) I can create a file GEN_sarc_0000.txt which contains "First off, That's grade A USDA approved Liberalism in a nutshell.". I can create a file GEN_notsarc_1136.txt which contains "First". 2) Then, I can put the file into container/sarc folder and container/notsarc respectively.

This way the current data loading can work as it is.

What do you think about this approach?

mubaris commented 6 years ago

v2 Dataset has columns Quote and Reply. That's why it's better than v1. If we have both parent comment and reply, I think our bot will have better accuracy.

Do not go down the method you proposed.

cagdasgerede commented 6 years ago

It sounds like you are describing a more substantial change. Then what are the steps of achieving what you propose? Since you label this as hacktoberfest, could you provide some more direction?

cromagnonninja commented 6 years ago

Can I work on this issue? What exactly are the problems or concerns regarding this issue at the moment?

mubaris commented 6 years ago

@bhanu1911

Current Method - We generate features from a single text field to train the models.

The desired Method - v2 Dataset provides 2 text field - question and reply to it. We want to make new models based on these 2 inputs.

Hope this helps

cromagnonninja commented 6 years ago

Basically this means we have to start from the ground up - we now have to train a model for the replies too, if I'm not wrong? (I'll study the code and see how you trained the first time around.) Plan of action:

  1. Split the csv file into two parts, quote and reply.
  2. Train and test both post division
  3. Configure the bot to send only those replies which get a reasonably high accuracy from all algorithms. I believe that'll be the way to go?
cromagnonninja commented 6 years ago

Could you guide me as to how you created the dataset?

mubaris commented 6 years ago

@bhanu1911 What I was thinking is little different.

This makes sense because Sarcasm is context based. Having comment and its parent comment will be accurate than a single comment.

mubaris commented 6 years ago

I think the source gives enough background about how they created the dataset - Sarcasm v2

cromagnonninja commented 6 years ago

I meant how did you partition the dataset?