sancha / jrae

I re-implemented a semi-supervised recursive autoencoder in java. I think it is a pretty nice technique. Check it out! Or fork it
http://www.socher.org/index.php/Main/Semi-SupervisedRecursiveAutoencodersForPredictingSentimentDistributions
72 stars 41 forks source link

Question about an exception #3

Open ahaowei opened 11 years ago

ahaowei commented 11 years ago

Hi, nice job. I'm also working sentiment analysis. Just trying to see the performance of your model on our dataset. I was trying to run it using the default parameters. But I got the exception "LAPACK Java.raxpy: Parameters for x aren't valid! (n = 50, dx.length = 207050, dxIdx = 207050, incx = 1)" I've search the online but do not have a clue. Do you know how do fix it? I'm running it on Win 7 64bit. Thanks! Z

sancha commented 11 years ago

Please make sure you are using the rc3 tag [https://github.com/sancha/jrae/tree/rc3] The latest version of the code may have some unresolved issues. This should resolve your issue.

ahaowei commented 11 years ago

Thanks. I've tried. But still throw the exception at line 26 in the RAEFeature.java (DoubleMatrix L = Theta.We.getColumns(WordIndices);). Also, it can get value of L in the first time. But after that it will throw the exception. Do you have any suggestions? Thanks.

sancha commented 11 years ago

This means that your data is changing between iterations which should never happen. Are you sure you are using the same options as in run.sh?

ahaowei commented 11 years ago

Yes. That's weird. Here are what I used:

-DataDir data/mov -MaxIterations 20 -ModelFile data/mov/tunedTheta.rae -ClassifierFile data/mov/Softmax.clf -NumCores 3 -TrainModel True -ProbabilitiesOutputFile data/mov/prob.out -TreeDumpDir data/mov/trees

I guess probably because I'm run it on a windows. The good thing is I can run the Matlab code 'codeDataMoviesEMNLP'. I think it can be used for different dataset, right? I will try to run it on our dataset. Thanks for your help!

sancha commented 11 years ago

Ah, so even on the dataset released with this tag, you are seeing this issue? In that case, I will try to reproduce it on my machine, and fix it. I think you might need to do some massaging of data to run the matlab scripts on it.

ahaowei commented 11 years ago

Thanks. I made some thing wrong with the RC3 tag. It works now. Thanks. One question is about the the training Accuracy. It output this:

[2761.0, 1989.0; 2037.0, 2808.0] Train Accuracy : { Precision : 0.5804069785174717 Recall : 0.5804148606811146 Accuracy : 0.5804064616988015 F1 Score : 0.5804109195725325 } Classifier trained. The model file is saved in data/mov/Softmax.clf Dumping complete

Not sure whether this is right. The matlab output the result as acc_train = 0.9024 acc_test = 0.7852. Do they use the same dataset?

Thanks Z

sancha commented 11 years ago

That certainly looks wrong. On the movies dataset, the java version gets about the same scores as the original version. I will look into it.

ahaowei commented 11 years ago

I see. Thanks.

sancha commented 11 years ago

I get the following performance from my run from rc3

Train Accuracy : { Precision : 0.7665694039182993 Recall : 0.766581261853936 Accuracy : 0.7665694039182993 F1 Score : 0.7665753328402609 }

Test Accuracy : { Precision : 0.7420262664165103 Recall : 0.7420680185889312 Accuracy : 0.7420262664165104 F1 Score : 0.7420471419154117 }

Did you run run.sh with no changes to it? The testing command is buggy and I need to fix it, but the train part works fine.

jiangfeng1124 commented 11 years ago

I encountered the "FileNotFoundException", indicating that "Too many open files", when I set the NumCores to 6. I found that some files are missing when dumping trees. I suggest that this problem may be caused by the "close file" operation at "TreeDumpThread" function (in file RAEFeatureExtractor.java). Maybe we need to add a "finally" block for closing files, in order to ensure all files are appropriately closed... Hope it is helpful :)

jiangfeng1124 commented 11 years ago

I am not sure whether the problem above really influences the result performance. When I perform 50 iterations while training, I got the following performance:

Train Accuracy : { Precision : 0.7271779908295123 Recall : 0.7275199110701512 Accuracy : 0.7271779908295123 F1 Score : 0.7273489107664195 } Test Accuracy : { Precision : 0.7026266416510318 Recall : 0.7034063604240283 Accuracy : 0.702626641651032 F1 Score : 0.7030162848401279 }

I wonder the performance would be better with more iterations. (such as 70 in the matlab version). I will test it...

jiangfeng1124 commented 11 years ago

When the MaxIter is set to 70, I get the following performance: Train Accuracy : { Precision : 0.7643809920800333 Recall : 0.7643835761037983 Accuracy : 0.7643809920800333 F1 Score : 0.7643822840897319 } Test Accuracy : { Precision : 0.7326454033771107 Recall : 0.732711754598462 Accuracy : 0.7326454033771107 F1 Score : 0.7326785774855982 }

This is 3 point higher than the previous one.

I also runned the matlab version, and got a very high accuracy on the training set: 0.9024 and 0.7852 on the test set. By comparing, the accracy on the training set looks a little confusing. Do you have any suggestions?

sancha commented 11 years ago

That is a very useful comparison. Did the matlab version also run for 70 iterations? The train-validation split may not be identical. The initialization of the features could also account for some noise in the final metrics. But thats a lot of gap which I don't have an explanation for. I had unit tested intermediate steps to match against the matlab code, so if the gap is reproducible, then it must be due to parallelization.

Thanks for fixing the file dumping issue! Could you please make the change and send me a pull request?

pnvphuong commented 10 years ago

Hi, I got the same 50ish accuracy with ahaowei when using RC3. The reason is I set the -MaxIterations 20, if you raise it to 70, Accuracy on the train set would be 70ish Cheers

fera0013 commented 9 years ago

I had the exception initially mentioned in this issue

(Java.raxpy: Parameters for x aren't valid!)

right after changing my workspace settings for text file encoding from cp1252 to utf8.