Open vinayakumarr opened 8 years ago
In the original code, we performed 40 batches with 600 files in each (300 pos/300 neg). So your code fails on batch 2 because it runs out of files to train on. I just updated the code to have variables to tune the development set size, number of batches, and batch size on line 46. Change these variables so that DEV_SIZE + BATCH_SIZE*BATCHES = len(train)
500+(50*10)=len(train). Successfully works. test data set is 250. for all test data set it is classifying as 0. How to see the fscore and precision and accuracy. I came to know in tmp folder batch files will be create and that has these information. But all files is showing accuracy precision and recall as 1. Do you where the problem has occurred or whether the results are correct.
that is odd that they all show up as 1. What does it say for TP, TN, FP, and FN? predict.py has the code for the metrics.
FP.txt and FN.txt files doesn't contain anything. But TP.txt and TN.txt contains 1001 lines (includes the name of files). You can see a part of TP.txt and TN.txt in the below attached image
TP
TN
predict.py shows below result
pos (train) contains 1002 files and neg (train) contains 1002 and test contains 252 files
Are the TN and TP pictures right? TP should display positive, and TN should show negative. It's also odd that it lists 2002 files instead of 2004 if there are 1002 files in each directory. Are there any batch_#.txt files in your tmp directory (such as batch_0.txt?
i mentioned wrong in previous post, such that pos (train) contains 1001 files and neg (train) contains 1001 and test contains 251 files
inside src/tmp folder batch_*.txt contains 50 files
each file has precision, recall, f1 and accuracy. it show total test files as 1000 but test has only 251 files.
what these batch files are?
How to correct this?
So the batch files are calculating the metrics on the development set during each batch iteration.
The reason it says 1000 files is because your dev set was specified as 500, and there was a bug that took 500 from positive and negative instead of 500 total.
I just updated the code to fix this, so now if you specify a dev size of 500, it will actually be 500.
Running predict by itself will do the calculations on the entire training set. To classify the test set, run test.py. This will generate a csv file of ID,LABEL.
Oops, just completed the push now. Try pulling, changing the SIZE variables and running again.
I have 1000 train/pos and 1000/neg, so i used
DEV_SIZE = 1000 # Split evenly between pos/neg BATCHES = 20 BATCH_SIZE = 50
is it correct?
Yes.
now again i get the previously solved error
That's odd, on line 161 of train.py, add the following just to make sure the script see 2000 files:
print len(training_set) ; exit()
Could you show batch_0.txt?
shows 1998
That would explain the pop from empty list error, and that means train.py only sees 999 files in each directory (pos/neg).
when i change to
DEV_SIZE = 998 # Split evenly between pos/neg BATCHES = 20 BATCH_SIZE = 50
it works fine but no improvement in results. result.csv file conain all values 0. Also batch files shows accuracy, precsion and csv all values as 1.
Is this becoz of less amount dataset to training?
Could be, having accuracy, precision and recall as 1.0 for all of the files is really peculiar. Is the training data you're using publicly available? I'd like to give it a try.
I am using your data only. I have taken few data from training and testing. I am trying it. I am not using my dataset.
problem solved. I am able to get the exact results. I have one more question that I am using another language data set. Could you please tell where and all we need to change the code in order to make it work and as well as to increase the accuracy.
That expression captures words that contains apostrophes and dashes. How did you fix the issue?
I didn't make any change in code. But I was not properly copied your data set to data folder. train folder had files but when i checked it was not containing any text. Your code works fine. I have another question I want to use it for other language so could you please tell where and all i need to change the code. i think only thing i need change is the following regex expression am i correct (?:[a-z][a-z'-_]+[a-z])
Yes that is what you would need to change.
I think your code will not work for above case. Am I correct
I'm not sure, I haven't dealt with different non-ASCII alphabets before.
The data error shown has to do with the file name, they have to be of the form
k thanks. I will try once i am able to run the code successfully, i will share it with you,
Due to memory problem i have used only 1000 files in train- pos and neg and when i execute python src/train.py, after some time it generate te following eror. How to solve this. Please see the attached file