renan-campos / MSA

Movie Sentiment Analyzer
0 stars 0 forks source link

generates empty list error #4

Open vinayakumarr opened 8 years ago

vinayakumarr commented 8 years ago

Due to memory problem i have used only 1000 files in train- pos and neg and when i execute python src/train.py, after some time it generate te following eror. How to solve this. Please see the attached file untitled

renan-campos commented 8 years ago

In the original code, we performed 40 batches with 600 files in each (300 pos/300 neg). So your code fails on batch 2 because it runs out of files to train on. I just updated the code to have variables to tune the development set size, number of batches, and batch size on line 46. Change these variables so that DEV_SIZE + BATCH_SIZE*BATCHES = len(train)

vinayakumarr commented 8 years ago

500+(50*10)=len(train). Successfully works. test data set is 250. for all test data set it is classifying as 0. How to see the fscore and precision and accuracy. I came to know in tmp folder batch files will be create and that has these information. But all files is showing accuracy precision and recall as 1. Do you where the problem has occurred or whether the results are correct.

renan-campos commented 8 years ago

that is odd that they all show up as 1. What does it say for TP, TN, FP, and FN? predict.py has the code for the metrics.

vinayakumarr commented 8 years ago

FP.txt and FN.txt files doesn't contain anything. But TP.txt and TN.txt contains 1001 lines (includes the name of files). You can see a part of TP.txt and TN.txt in the below attached image

TP tn

TN tp

vinayakumarr commented 8 years ago

predict.py shows below result predict

pos (train) contains 1002 files and neg (train) contains 1002 and test contains 252 files

renan-campos commented 8 years ago

Are the TN and TP pictures right? TP should display positive, and TN should show negative. It's also odd that it lists 2002 files instead of 2004 if there are 1002 files in each directory. Are there any batch_#.txt files in your tmp directory (such as batch_0.txt?

vinayakumarr commented 8 years ago

i mentioned wrong in previous post, such that pos (train) contains 1001 files and neg (train) contains 1001 and test contains 251 files

inside src/tmp folder batch_*.txt contains 50 files

each file has precision, recall, f1 and accuracy. it show total test files as 1000 but test has only 251 files.

what these batch files are?

How to correct this?

renan-campos commented 8 years ago

So the batch files are calculating the metrics on the development set during each batch iteration. The reason it says 1000 files is because your dev set was specified as 500, and there was a bug that took 500 from positive and negative instead of 500 total.
I just updated the code to fix this, so now if you specify a dev size of 500, it will actually be 500.

Running predict by itself will do the calculations on the entire training set. To classify the test set, run test.py. This will generate a csv file of ID,LABEL.

renan-campos commented 8 years ago

Oops, just completed the push now. Try pulling, changing the SIZE variables and running again.

vinayakumarr commented 8 years ago

I have 1000 train/pos and 1000/neg, so i used

DEV_SIZE = 1000 # Split evenly between pos/neg BATCHES = 20 BATCH_SIZE = 50

is it correct?

renan-campos commented 8 years ago

Yes.

vinayakumarr commented 8 years ago

now again i get the previously solved error untitled

renan-campos commented 8 years ago

That's odd, on line 161 of train.py, add the following just to make sure the script see 2000 files: print len(training_set) ; exit()

Could you show batch_0.txt?

vinayakumarr commented 8 years ago

shows 1998

renan-campos commented 8 years ago

That would explain the pop from empty list error, and that means train.py only sees 999 files in each directory (pos/neg).

vinayakumarr commented 8 years ago

when i change to

DEV_SIZE = 998 # Split evenly between pos/neg BATCHES = 20 BATCH_SIZE = 50

it works fine but no improvement in results. result.csv file conain all values 0. Also batch files shows accuracy, precsion and csv all values as 1.

Is this becoz of less amount dataset to training?

renan-campos commented 8 years ago

Could be, having accuracy, precision and recall as 1.0 for all of the files is really peculiar. Is the training data you're using publicly available? I'd like to give it a try.

vinayakumarr commented 8 years ago

I am using your data only. I have taken few data from training and testing. I am trying it. I am not using my dataset.

vinayakumarr commented 8 years ago

problem solved. I am able to get the exact results. I have one more question that I am using another language data set. Could you please tell where and all we need to change the code in order to make it work and as well as to increase the accuracy.

renan-campos commented 8 years ago

That expression captures words that contains apostrophes and dashes. How did you fix the issue?

vinayakumarr commented 8 years ago

I didn't make any change in code. But I was not properly copied your data set to data folder. train folder had files but when i checked it was not containing any text. Your code works fine. I have another question I want to use it for other language so could you please tell where and all i need to change the code. i think only thing i need change is the following regex expression am i correct (?:[a-z][a-z'-_]+[a-z])

renan-campos commented 8 years ago

Yes that is what you would need to change.

vinayakumarr commented 8 years ago

I think your code will not work for above case. Am I correct

renan-campos commented 8 years ago

I'm not sure, I haven't dealt with different non-ASCII alphabets before.

The data error shown has to do with the file name, they have to be of the form _.txt or at least .txt with no non-numeric characters.

vinayakumarr commented 8 years ago

k thanks. I will try once i am able to run the code successfully, i will share it with you,