Regarding fit_best_classifier

thirumalrao commented 8 years ago

In this method, we use the features of the result that had highest accuracy, and fit a classifier to training data. Training Data is given as doc strings. Can we use kFold ? If not, how do we convert it to a format such that clf.fit(X[training_data],labels[training_data]) ?

AndrewLu1992 commented 8 years ago

You need to firstly parse all doc strings to tokens and use vectorize function to convert all tokens into csr_matrix and vocabulary. Run cross_validation on the csr_matrix and pick the combination with the highest accuracy(among all cross_validations). Finally, you fit the final classifier with the combination.

tanarj commented 8 years ago

No cross validation should be done in this function, as mentioned in the comments, so no need for KFold: """ Using the best setting from eval_all_combinations, re-vectorize all the training data and fit a LogisticRegression classifier to all training data. (i.e., no cross-validation done here) """ This method takes the parameters: docs..........List of training document strings. labels........The true labels for each training document (0 or 1) best_result...Element of eval_all_combinations with highest accuracy

So your goal is to use your best result and fit a logistic regression classifier on it (this classifier will then be passed to one of the next functions where it will be used to predict on the test data). Since you have raw docs, you will have to tokenize again based on your best result, and then you'll call vectorize, and fit a logistic regression model on the csr matrix thats returned from vectorize.

AndrewLu1992 commented 8 years ago

@saiarjuntanguturi You are right.

thirumalrao commented 8 years ago

@AndrewLu1992 @saiarjuntanguturi - thanks. I believe I have done the same thing. Best result matches with the logText. However, top misclassified docs is way off. My results have probabilities ranging from 0.95 to 0.73.... so I am wondering what else could be wrong. thanks for confirming my understanding.

AndrewLu1992 commented 8 years ago

Check if you followed ranking logic described in #311 and #309

thirumalrao commented 8 years ago

culprit was in vectorize method .. I had missed passing vocab from test data . Thanks.

iit-cs579 / main

Regarding fit_best_classifier #308