Discuss with Jacopo - Githubissues

@anhtuduong Below are the discussion between me and Jacopo recently, keep updated !

Thu: Hi ! This is the fold 0 training of dataset_2 (with histology), but the dice score is very low ~0.35 I think this is because the imbalance of classes

I have an idea which is to train only the classes have more than 15 cases, so it will be less bias and the training have higher dice.
The classes after reorganize will be like this:

What do you think ?

Jacopo: The short version is: go for it, it makes sense to only keep classes with at least 15 cases ` The long version is: it might not make too much sense to look at the final dice score, as it is generally computed (I think in this case as well, if I remember correctly the code I read on Friday) as the average of the dice scores for each class. This means that it is entirely possible that the model learned correctly (or close to) to segment the easier classes and it is doing an awful job on the harder ones (read: the less represented tumor forms), leading to a very low final dice score value. It is entirely possible that the model is already perfectly capable of discriminating between, let's say, clear cell RCC and oncocytomas even with the low dice score we're seeing. I think the relevant step, now, would be to verify whether that's actually the case or not. The most straightforward way of doing that is probably to realize a confusion matrix, i.e. compute the percentage of voxels computed in each of the tumor classes, something like the one below (the numbers are completely random and I only used five labels): ![image](https://github.com/thuhoainguyen/kits23/assets/165920750/c26b33eb-6417-4bb0-85aa-61070c9a57d1) A slightly expanded explanation would be: for each case, count the voxels predicted in each lesion class (I think we can ignore the background, healthy tissue and cysts, here) and divide them by their total number. Then average over the TRUE labels (i.e. average together all cases with the same histology class). You should end with a table such as the one above, so we can learn if there's specific types of lesions the model is getting confused over, if it's just having an issue with the rarest ones or if it is not working at all. Let me know if something is unclear, I'll try to get back to you as soon as possible`

@anhtuduong Update about the discussion with J: Thu: I've trained the dataset_3 (preprocessed with SELECTED histology classes clear_cell_rcc, chromophobe, oncocytoma, papillary, the rests are grouped into other). Fold 0 3d_lowres. Here's the results on the validation set: https://drive.google.com/file/d/1r0DUAsFcmBjNz3JHiD89H-biCe2Ie-th/view?usp=sharing We can clearly see that the DICE for kidney, cyst, tumor is as high as we train the original dataset. Which means that the model learned perfectly the first 3 classes. The result file even shows the predictions on every cases in validation set. The predictions made on the histology classes are not very well. I wrote a script that extract the info and generate a confusion matrix table:

Jacopo Hi Thu! A few comments: good that the dice scores for the other classes are as before; I'm not sure what I'm seeing in those confusion matrices: I'm guessing you selected the most commonly predicted label and used that as a way to label the entire case, in the testing dataset alone, is that correct? if that's the case, indeed the resulting classification looks rather bad, which is rather puzzling considering that the "other" label is not even the most common; my guess is that the "other" label has the highest variance, so it tends to become the default class for any lesion that the model doesn't learn to predict with high confidence; I can think of two quick fixes to evaluate if the resulting model is still somewhat useful: re-compute the confusion matrices ignoring completely the "other" class (i.e. restrict the matrix to the 4 named labels and if, for a case, the most predicted class is "other" select the second most predicted one instead); the second fix would be that of computing the AUC (Area under the curve) for all class pairs. This should be done, for each case, on the sum of the raw outputs of the networks (i.e. the logits), not on the count of voxels, if possible. I'm rather sure there's already Python packages to do that: sorry I can't explain this further myself as I'm somewhat short on time, now; following from the point above, I'm afraid I can't check your draft, for the moment. I'm undergoing surgery tomorrow, so I won't be able to do that until Friday (unlikely) or Saturday (much more probable). I should have enough time over the weekend to go through the whole thing. Sorry about that. If you have any other questions or doubts, please do ask, I'll try to answer to the best of my abilities as soon as I can (again, definitely not before Friday).

@anhtuduong Jacopo just checked my thesis report and gave me some comments below:

I read the whole thesis, here's some comments:

one of the main points of the whole work is whether the nnUnet framework, by itself, can be used for lesion classification, while, currently, the only stated goal was that of verifying whether providing the histology to the network led to an improvement in classification performance. I think this objective should be clearly stated in the abstract and in another couple of points in the text where the other objective is mentioned;
it is important to mention this objective as it is the main point to discuss in the results: the direct application of the nnUnet framework, by itself, seems uncapable of a good classification of even the 4 most represented lesion types, which is a rather important point to confirm;
the exact reason for this is unclear, at the moment, but from the reported confusion matrix, it would seem that the most problematic class, by far, is the oncocytoma, which is also the least represented. By contrast, the distinction between clear cell RCCs and and papillary tumors, the two most common, is at least decent;
this is somewhat to be expected, as, generally, nnUnet is designed to handle classes of different sizes, but represented in all cases. In this dataset, this doesn't happen. It is a reasonable guess that this is the cause of the lack of classification power of the trained model, since discrimination capabilities between classes roughly follows their numerosity;
in the future work section, I would add the fact that the work performed so far highlights that the nnUnet framework is a decent starting point for the classification of kidney lesion types, but, in its most straightforward form, seems to struggle with the different appearance rate of the various lesions. The next iteration to try and improve results would be to modify the nnUnet framework with a weighting system that takes into account also the number of cases in which each label appears.

Beside the points above (which I would try to address, if you manage to), in general, the thesis is well structured and well written, congrats!

If you want to save a few pages, as a very first thing I'd reformat a couple of lists which are currently taking a lot of space: one in intensity normalization and one in Extension of the dataset with Histology-Specific data. I think you can use a horizontal table for the first one and entirely remove the second, as it is difficult to understand and doesn't add very much information.

The only other section that might need to be revised (if you find the time) is that on related works: while the chosen ones are relevant, it is unclear why you chose specifically those ones, as they lead to mid-positions in the KiTS21 challenge. Is there a reason to mention those and not the winners for instance?

thuhoainguyen / kits23

Discuss with Jacopo #16