Questions about model evaluation parts

SeYoungKim commented 6 years ago

Hello

I'm trying to understand the code. And... I have questions about model evaluation parts...

About evaluating RF classifier with the training data...

As I understand, the Random Forest classifier is trained by using the whole AMC dataset. So AMC dataset is used as a training set and a set of the other data(LUMC, CAIRO, CAIRO2) is used as a validation set.

So when the classifier was trained with AMC dataset, its OOB error rate was 19.83%.

And in the paper, Fig 1. C says its accuracy is 87.14%.

So I looked for the part in the [4.1.1 Training the classifier, Supplementary Info...]. There was a code snippet for drawing the table(but I couldn't find it in the current repo.. is there a different source file which is same as [Supplementary Info...]?)

I guess the result of the gold-standard transcriptome-based classifier means the GE result and... isn't it the Class column values in AMCclinical.csv?

Seems like the labels used for training were used for evaluation...

Is it correct that 87.14% is the result of evaluation with the training data?

If so, isn't it that only the OOB value has meaning?

Thank you

SeYoungKim commented 6 years ago

Isn't the accuracy of the RF on the training set(per sample, not per patient) is (almost)100%?

I tried this again in the different environment with the same data and I got similar answers...

About 20% of oob error and almost 100% of accuracy on the training data..!

SeYoungKim commented 6 years ago

I have one more question about evaluation...

There is a histogram like below in [Supplementary Information...] and I wonder if this graph is for test set accuracies in k-fold cross validation...

I'm reading this...

As I understand, you're performing k-fold cross validation...

and this table (rfpred) is the result from the one time evaluation.(70%)

And there is a picture of some distribution like below...

My question is, Is this distribution is for the OOB distribution or k-fold cross validation...?(200 times)

The code drawing the histogram (Core classification accuracy) is using the result from the ClassifyPrediction and I couldn't find it from the current repo...

There are two lines: "The accuracy for the above is 0.7. This is repeated 200 times" and "Using this method, the OOB distribution and prediction accuracy is as follows". Here I can't make out which one the histogram is for...

trinhan / crc-ihc-classification

Questions about model evaluation parts #4