Closed thirumalrao closed 8 years ago
You need to firstly parse all doc strings to tokens and use vectorize function to convert all tokens into csr_matrix and vocabulary. Run cross_validation on the csr_matrix and pick the combination with the highest accuracy(among all cross_validations). Finally, you fit the final classifier with the combination.
No cross validation should be done in this function, as mentioned in the comments, so no need for KFold: """ Using the best setting from eval_all_combinations, re-vectorize all the training data and fit a LogisticRegression classifier to all training data. (i.e., no cross-validation done here) """ This method takes the parameters: docs..........List of training document strings. labels........The true labels for each training document (0 or 1) best_result...Element of eval_all_combinations with highest accuracy
So your goal is to use your best result and fit a logistic regression classifier on it (this classifier will then be passed to one of the next functions where it will be used to predict on the test data). Since you have raw docs, you will have to tokenize again based on your best result, and then you'll call vectorize, and fit a logistic regression model on the csr matrix thats returned from vectorize.
@saiarjuntanguturi You are right.
@AndrewLu1992 @saiarjuntanguturi - thanks. I believe I have done the same thing. Best result matches with the logText. However, top misclassified docs is way off. My results have probabilities ranging from 0.95 to 0.73.... so I am wondering what else could be wrong. thanks for confirming my understanding.
Check if you followed ranking logic described in #311 and #309
culprit was in vectorize method .. I had missed passing vocab from test data . Thanks.
In this method, we use the features of the result that had highest accuracy, and fit a classifier to training data. Training Data is given as doc strings. Can we use kFold ? If not, how do we convert it to a format such that clf.fit(X[training_data],labels[training_data]) ?