Radiologist scores for annotated lesions

rcuocolo / PROSTATEx_masks

Lesion and prostate masks for the PROSTATEx training dataset, after a lesion-by-lesion quality check.

https://rcuocolo.github.io/PROSTATEx_masks/

Creative Commons Attribution 4.0 International

77 stars 17 forks source link

Radiologist scores for annotated lesions #5

Closed pritesh-mehta closed 3 years ago

pritesh-mehta commented 3 years ago

Hi again Renato,

Were PI-RADSv2 scores or otherwise (blinded to Gleason score) assigned to the lesions during the contouring exercise? If yes, would be very helpful to have them released.

Many thanks, Pritesh

rcuocolo commented 3 years ago

We did not reassign PI-RADS scores during the annotation process. As PI-RADS is subject to inter-reader variability and the database already collected retrospectively, it would make little sense as there would be no way to obtain a ground truth (i.e., biopsy) on a "new" PI-RADS 3+ lesion. On the other hand this would potentially cause confusion for less experienced users or non-radiologists/urologists. Dichotomized PI-RADS v2 scores are however available from the original dataset implicitly. The original readers are from an experienced center and all lesions with PI-RADS 3 or higher underwent biopsy (i.e., are in the PROSTATEx2 subset). The remaining (difference between PROSTATEx and PROSTATEx2 datasets) are all PI-RADS = 2 lesions. This is surely a limitation as we know how many false positives are present but cannot be sure of the false negatives (the original Authors state 5% based on their previous experiences at their center). Feel free to use or not this PI-RADS information in data analysis, based on your aim it can be useful or detrimental.

pritesh-mehta commented 3 years ago

Thanks Renato. Understood re not reassigning PI-RADS scores.

My understanding of the difference between the PROSTATEx AND PROSTATEx2 dataset is different to yours. I agree that PI-RADS = 2 lesions were not biopsied. The goal of the PROSTATEx challenge was to differentiate between CSPCa (GS 7+) lesions and non CSPCa lesions (benign and GS 6). Since the goal of the PROSTATEx2 challenge is classification of lesions into Gleason grade groups (1,2,3,4,5), where 1=3+3, 2=3+4, 3=4+3, 4=Gleason 8, 5=Gleason 9-10, PROSTATEx2 is the same as PROSTATEx, but with benign lesions removed. I do not believe PI-RADS = 2 lesions feature in either the PROSTATEx or PROSTATEx2 datasets.

rcuocolo commented 3 years ago

Unfortunately, I believe you misinterpreted the PROSTATEx dataset description. Not all lesions in the PROSTATEx dataset underwent biopsy, and the PROSTATEx2 dataset includes both clinically significant and not lesions. In detail: 1) Regarding the first point, in the "detailed description" of the PROSTATEx dataset wiki page, it is specified: "ClinSig – Identifier available in training set that identifies whether this is a clinically significant finding. Either the biopsy GleasonScore was 7 or higher. Findings with a PIRADS score 2 were not biopsied and are not considered clinically significant. In our center the occurrence of clinically significant cancer in PIRADS 2 lesions is less than 5%." 2) In the class csv file for the PROSTATEx2 dataset (and in the class csv file in our repository for both datasets), one can clearly see that GG 1 (i.e., GS 3+3, not clinically significant) lesions are present.

From these points and from checking lesion IDs across the two datasets, it can be seen that all biopsied lesions were included in the PROSTATEx2 subset, whether clinically significant or not. The remaining lesions, all not significant and not biopsies, must be PI-RADS = 2. Please let me know if you have evidence of a different distribution of the cases across the datasets.

pritesh-mehta commented 3 years ago

Thanks again Renato. I've looked over all the information and believe you are right in all you say. I've been working with this dataset for a long time, so you have cleared a longstanding misconception.

I do find it slightly weird that none of the lesions in PROSTATEx2 (PI-RADS 3+ as you say) were identified to be benign (BPH, prostatitis, etc.) following biopsy, but since Jelle Barentz is highly experienced, those benign lesions may have all been correctly scored PI-RADS = 2. Would you agree? Just to be clear, when I say benign, I do not include GS 3+3 in that.

rcuocolo commented 3 years ago

After reviewing the images, all issues we had were exclusively with PI-RADS 2 lesions, and essentially due to misplaced markers/unclearly defined lesions. I do believe that the PI-RADS 2 scores were assigned correctly, even though a focal lesion is not always clearly identifiable in correspondence of the original coordinates (e.g., diffuse signs of previous prostatitis in the PZ without a focal alteration). On the other hand, a certain amount of false positives in PI-RADS 3+ lesions is also expected, and their number is not strange in relation to expected reader experience and mpMRI performance. So overall I agree that the PROSTATEx PI-RADS scores are in line with clinical practice and could be used as a variable, with the limitation of not having precise scores but only a dichotomous information (i.e., PI-RADS 2 or 3+). I am happy to have cleared your doubts, favoring open discussion on the dataset was one of the reasons for building this repository. If you agree, you can close the issue discussion as I think it can be considered resolved.

pritesh-mehta commented 3 years ago

Thanks again Renato. Just a couple more things. I have calculated the lesion sensitivity and specificity for the PI-RADS 3+ threshold, for CS vs not CS classification, and find the specificity to be higher than expected. I calculate:

TP = 76 (any finding marked as 2,3,4 or 5 in PROSTATEx2) TN = 218 (any PI-RADS = 2 lesion not included in PROSTATEx2) FP = 36 (any finding marked as 1 in PROSTATEx2) FN = 0 (since PI-RADS = 2 lesions were not biopsied, but this could be a non-zero low number had PI-RADS = 2 lesions been biopsied)

TP + TN + FP + FN = 330 lesions. Using these numbers, sensitivity = 100% and specificity = 86%. 86% seems very high for the level of sensitivity achieved, based on other papers that report sensitivity and specificity for the PI-RADS 3+ threshold, for the CS vs not CS task e.g. in the Litjens 2015 paper entitled "Clinical evaluation of a computer-aided diagnosis system for determining cancer aggressiveness in prostate MRI", they report a sensitivity of 99% and specificity of 25.9%.

However, maybe we must assume the 5% occurrence of CS in PI-RADS = 2 lesions in the calculation. If we do, FN = 11 (218 * 5%) and TN = 207 (218-11). This gives sensitivity = 87% and specificity = 85%. This feels more realistic to me, than sensitivity = 100% and specificity = 86%.

Agree with these calculations?

The other thing which confuses is the PI-RADS performance shown in the Litjens 2014 paper entitled "Computer-Aided Detection of Prostate Cancer in MRI". Here, as I'm sure you know, they use 346 patients (PROSTATEx train + test) for evaluation. For the task "High-grade cancer vs normal/benign", at the patient-level, they show a PI-RADS 3+ sensitivity of approx. 100% and specificity of approx. 52%:

litjens_2014_roc

They took the max lesion PI-RADS score as the patient score. For the HG vs normal/benign task, they do not consider a hit on low-grade tumor (GS 3+3) to be a false positive:

litjens_2014_low_grade_not_fp

For a patient to be a false positive, they must have had one or more PI-RADS 3+ lesions, which on biopsy, were all found to be benign or GS 3+3. However, from our discussions above, we know that none of the biopsied lesions were benign (BPH, prostatitis etc., restating for clarity) and if hits on GS 3+3 are not considered as FP, then the specificity at the patient-level for PI-RADS 3+ should be 100% rather than 52% for the radiologist? I must be mistaken here as it is unlikely for such a large error to have been made, but I cannot see how.

Appreciate all the help and time. Understand that the publications mentioned are not yours, so it may be difficult to comment on them, but it will be helpful to understand if you share my confusion or as before, can remove it.

rcuocolo commented 3 years ago

I agree with your calculations, even though the 5% FN rate is reported as the upper limit of the expected errors. It might be lower in reality. Another issue with the following analysis is that we do not have any information on the test set cases. Hypothetically, this could have different characteristics, even though it would not be logical as the train and test set should share similar features given they originate from the same source. I checked the original PROSTATEx2 class csv to confirm that no cases are reported without a GG score (i.e., all lesions are presented as cancers with at least a 3+3 GS). Unfortunately, without having the data for the test set lesions it is not possible to draw any certain conclusions, even though I admit the situation is puzzling. Part of the reason for which we focused exclusively on the training data was exactly the availability of more information (i.e., the class), even though the ground truth for non-biopsied lesions is somewhat weaker.

pritesh-mehta commented 3 years ago

Thanks Renato. Agreed, we cannot solve the issue relating to figure 4b described above, without information of the PROSTATEx test set. I'll send Geert Litjens an email, and update here if I get the answer. Closing this for now.

pritesh-mehta commented 3 years ago

On re-reading the document entitled, ProstateX2-DataInfo-Train.docx, I have come across a sentence that may provide some added clarity to the structure of the PROSTATEx findings. Regarding the findings included in PROSTATEx2, the sentence reads: "The findings are a subset of the prior PROSTATEx findings of cancer lesions with biopsy information." I think the key here is that of all the lesions that were biopsied, only the cancer lesions (GG 1,2,3,4,5) with biopsy were included in PROSTATEx2, which may indicate that there exists benign lesions (prostatitis, bph etc.) that were PI-RADS = 3+, biopsied, not included in PROSTATEx2. Feels more realistic to me (though i'm not a clinician) than assuming that ALL benign lesions were prospectively marked PI-RADS = 2. Based on this I believe the following regarding the categorization of clinically significant and clinically insignificant PROSTATEx findings:

Clinically significant = True: PI-RADS = 3+ lesions, underwent biopsy, found to be Gleason score 7+ Clinically significant = False: PI-RADS = 2 lesions AND PI-RADS 3+ lesions belonging to GG 1 that underwent biopsy AND PI-RADS 3+ benign lesions that underwent biopsy

This would make it impossible to infer the sensitivity/specificity for the PI-RADS = 3+ threshold, as we cannot infer whether a lesion not included in PROSTATEx2 is PI-RADS = 2 or PI-RADS 3+ benign. This would make my calculations above void. The existence of PI-RADS 3+ benign lesions could also explain why the specificity for the radiologist in graph 4b in the original publication is approx. 52%.

How does that all sound?

rcuocolo commented 3 years ago

This may be the correct interpretation. Therefore, all PROSTATEx2 lesions were surely PI-RADS 3+, but not all PROSTATEx-PROSTATEx2 lesions were PI-RADS 2. From a practical point of view, this does not allow to deduce PI-RADS scores with sufficient certainty. In general, I think there is little sense to trying to add PI-RADS scores to this dataset. As stated in my previous posts, there is no way to check a non-biopsied lesion retrospectively. Only the PROSTATEx2 set could be used, with blinded readers, but the data is heavily unbalanced towards clinically significant lesions compared to actual clinical practice (which may produce bias in and of itself in a blinded reader). My suggestion would be to avoid these situations altogether, unless the original PROSTATEx maintainers were not willing to provide additional clinical data. More than PI-RADS, PSA at the time of the exam would be much more interesting. It could also be worthwhile to contact them in the future to see if, given the amount of time passed from the original challenge, the test set classes could be publicly released. This would allow to expand our efforts to those cases as well.

pritesh-mehta commented 3 years ago

Thanks Renato. Agree that we cannot deduce PI-RADS. I emailed Henkman Huisman a few weeks ago to see if they could release PI-RADS scores, but the reply was they did not have time currently to update for PI-RADSv2. I also asked if clinical variables like age, PSA, PSA density could be released, but Henkman does not want to as they do not currently have the capacity to make the PROSTATEx challenges more clinical. I have not asked for labels for the test set. I think this can be closed now. Happy we were able to reach a greater understanding. I think you will need to update the file PROSTATEx_Classes.csv as a follow-up action, changing "No biopsy" label to "No biopsy (PI-RADS = 2) / benign" or whatever you feel describes the situation best.

rcuocolo commented 3 years ago

Thank you for helping clear this issue. Shortly, I will rename the class csv file, probably "no biopsy information" is more accurate as we can not know which lesions underwent biopsy without finding cancer and which did not and presenting a "double information" (as your example) could be confusing if someone does not read this thread or have good knowledge of the dataset.