CSV file - Githubissues

jnirschl commented 1 year ago

I have a question regarding the CSV files in this repository and CSV files in the Tiles.zip file (Zenodo).

What is the significance of the numeric value for each of the classes (cored, diffuse, CAA, negative). They are not in one-hot encoding, and also do not sum to 1 over rows which would suggest a probability distribution. For example: the second row in the screenshot below has CAA=2, Negative = 0.1233, Flag=2, and Not sure=0.1233. It easy to take the argmax and assume that is the ground truth, but I would like to better understand what these numbers mean for each column. Another image has Diffuse=1.9404 and Not sure=1. What do these numbers represent? Also, what is the significance of the numeric value for the "flag" column, if any?

Screenshot from CSV file in Tiles.zip Zenodo

mjke commented 1 year ago

hi @jnirschl ,

Also, what is the significance of the numeric value for the "flag" column, if any?

per the paper,

Additional categories such as not sure or flag denoted uncertainty, image segmentation failures, or other special cases (Supplementary Fig. 3).

wrt your other questions, @ZiqiTang919 could you clarify? (cc @lise-minaud @sghandian @wongdaniel8 )

ZiqiTang919 commented 1 year ago

Hi @jnirschl, basically the number in each row indicates the counting of the corresponding categories in the image. I think understanding the entire process may help clarify the confusion.

During the image preprocessing step, a bounding box was automatically drawn for each candidate plaque. Then an image was generated centered on each candidate for labeling. A label of 0 or 1 was given to each candidate for each category by the neuropathologist. Finally, we incorporated all the labeled images to construct the training dataset. Not that the images for labeling (centered cropped on the bounding box) are different than the images for model training and validation (uniformly segmented from WSIs). The final label for a training image is the aggregation of all the original labels it contains. When a training image contains more than one bounding box, the number for that image can be greater than one. When part of the bounding box is included in a training image, the original label would be multiplied by the percentage of the area of intersection. That's why the number may be a decimal.

mjke commented 1 year ago

great, thanks @ZiqiTang919 . am I correct in remembering that you discretize the labels for model training/etc? e.g., https://github.com/keiserlab/plaquebox-paper/blob/36d8c17e799a3d46259b4dbf01d53fc1756ebf21/2.1)%20CNN%20Models%20-%20Model%20Training%20and%20Development.ipynb?short_path=2dd3c08#L116

ZiqiTang919 commented 1 year ago

great, thanks @ZiqiTang919 . am I correct in remembering that you discretize the labels for model training/etc? e.g.,

https://github.com/keiserlab/plaquebox-paper/blob/36d8c17e799a3d46259b4dbf01d53fc1756ebf21/2.1)%20CNN%20Models%20-%20Model%20Training%20and%20Development.ipynb?short_path=2dd3c08#L116

Yes, correct.

mjke commented 1 year ago

thanks @ZiqiTang919

@jnirschl closing this, but please let us know if any questions remain

jnirschl commented 1 year ago

Thanks that makes sense!

keiserlab / plaquebox-paper

CSV file #4