Which score do we use? - Githubissues

constantinpape commented 4 years ago

Moving the discussion from the mail thread here: Which measurement should we use for the final score?

Currently, we use the mean intensity per cell. From what I understand, the best alternative would be to use the sum (see comment below).
Should we include or exclude the nucleus from the intensity accumulation?

This is not my expertise, so I am impartial on what to do here. However, as we have changed a lot over the last few days, I am strongly in favor of re-running everything with the current score first before we change anything so we don't change so many things at once.

cc @tischi @metavibor @imagirom

constantinpape commented 4 years ago

Comment from @tischi from the mail thread:

As expected, I am quite opinionated here :-)

Because we do widefield microscopy for computation of scores I think we should compute sum intensity and include nucleus. The sum intensity has a direct biological interpretation: The measured signal (because we do widefield microscopy) is proportional to the total amount of antibody in the cell. This would be a completely different story if this was confocal microscopy. Then I would very strongly vote for mean intensity excluding the nucleus, in this case the biological interpretation would be: concentration of antibody in the cytoplasm.

Including the nucleus for sum makes sense, because there is some signal below and above the nucleus in the cytoplasm which we capture with the widefield microscope. And since we compute the sum, it is not wrong to include the nucleus even if the intensity in there is zero, because we just add zeros, however if we exclude it we loose signal.

imagirom commented 4 years ago

I agree that we should rerun everything with the current score first. But at least regarding the first decision (mean vs. sum), both options are computed for all the scores (replace suffix _means with _sums in the column name of the image / well table).

So once we have the manual assessments in the database, we can check which score matches them best without recomputing anything. I also think that this should at least be a main criterion for the decision on what to use in the end.

metavibor commented 4 years ago

tischi made a good point with arguments why to include the nucleus and why to use the sum. I think we should include the nucleus and compute the sum and see the results. I don't see a big issue with using the "mean" as we did so far. And excluding the nucleus. Mean would translate into "concentration of the antibody per cell" in biological sense (although it is going to be largely overestimated since we don't take a confocal image) and that is fine to measure. And it is obviously working. Tischi and I can continue this technical discussion offline "Sum" would translate into "amount of antibody per cell" and that would be also good to measure. The caveat here is that if infected cells are for some reason larger/smaller than control cells (larger/smaller cells = more/less antibody binding due to variations in cell size) we would get results skewed one way or the other due to size effects. Since I did not observe any size effects I think "Sum" should be tried (and in this case nucleus should not be excluded)

metavibor commented 4 years ago

@imagirom what do you mean by "manual assessment in the database"?

imagirom commented 4 years ago

@metavibor I meant once we have parsed your visual assesments and included them in the Database (https://github.com/hci-unihd/batchlib/issues/58), we should compare to those and take the results into account for which score we select.

constantinpape commented 4 years ago

@metavibor I meant once we have parsed your visual assesments and included them in the Database (#58), we should compare to those and take the results into account for which score we select.

Just to clarify, we will use this Database to keep all our results better organised. From your side we would just need the results of the manual assessment in some form (excel sheets would be totally fine), we will write a script to get them in the data-base then.

tischi commented 4 years ago

@imagirom The task in this assay is to quantitatively measure who much brighter one cell population is compared to another, taking the average of many tens to hundreds of cells. This is not a task that the human visual system is good at. Computers are much better with such tasks. Thus I feel we cannot rely on human ground truth here, but should try to perform solid scientific measurements, as good as possible. Of course, it is very interesting and important to check what happened in images where the computer disagrees with some human assessment, but it is not per se obvious that the human judgement is the correct one. @metavibor what's your take on this?

tischi commented 4 years ago

And regarding the mean vs. sum discussion: I agree with @metavibor and have also mentioned this in some other issue: probably the first thing to check is whether the cell size (in number of pixels) is different in the infected vs. control. Because if it is not the same it could throw off both the mean and sum based measurements in various ways, depending on whether the cells have (a) different average volume or (b) different contact area with the coverslip and the same volume (something we cannot know from our data, unfortunately).

akreshuk commented 4 years ago

@tischi , by this line of reasoning you'd need to introduce an independent validation method not related to human vision. I can imagine many such methods for a normal biological experiment, but not here. What would you suggest?

imagirom commented 4 years ago

@tischi I am sorry for being imprecise in my previous posts in this thread. The point I wanted to make was that In my opinion we should base the decision of what score to use (including mean vs. sum) not solely on the conceptual advantages / disadvantages, but also take into account the agreement of the score with some ground-truth annotations. I totally agree with you, and also with @metavibor as he mentioned in slack, that the information from which cohort the sample has originated is a better proxy for this than the visual inspection. @akreshuk I assume that is what @tischi was thinking of as an alternative. That being said, I still think that the visual inspection can be very useful to differentiate between failures of the automated analysis and actual false negatives/positives of the test.

akreshuk commented 4 years ago

Thanks, @imagirom , I looked at the slack discussion now. It would be interesting to see how our estimate of the FP/FN rate of the readout would change depending on the error rate on the cohort level which @metavibor mentioned cannot be guaranteed to be 100% correct. But that probably concerns FNs more than FPs.

tischi commented 4 years ago

I can imagine many such methods for a normal biological experiment, but not here. What would you suggest?

One thing could be to manually measure the intensities in some cells, e.g., draw ROIs in ImageJ and measure there, and then compare the measured values with the ones that the automated analysis gives. So, once we know what the actual plates are that we want to publish we should probably do this for a good number of cells. I was doing that for just two cells already several days ago, just to get a feeling of what the numbers should be: https://github.com/hci-unihd/antibodies-analysis-issues/issues/14 We could do this more consistently with more cells and compare to the automated numbers.

@akreshuk what do you think? @metavibor do you have an other idea?

@imagirom yes, comparing with the cohorts where we have a good idea whether they should show some immunity or not is of course also something we should do!

tischi commented 4 years ago

That being said, I still think that the visual inspection can be very useful to differentiate between failures of the automated analysis and actual false negatives/positives of the test.

@imagirom @akreshuk I think maybe my issue here is that I am not sure what "visual inspection" means. If it means manually drawing ROIs and measuring intensities and intensity ratios in ImageJ then I would say yes of course! But I was not sure this was the definition.

akreshuk commented 4 years ago

Getting a feeling is something I always find useful, but unless segmentation fails, why would there be a difference between your manual score and the automated one, especially if you don't remove the nucleus? This is not really an independent readout like the cohort information. But this way (your way) you can at least estimate how dependent your score is on segmentation errors, which is by itself something we should know, especially if the signal is not always uniformly distributed.

So I would say (very cautiously, because what do I know, I haven't been so closely involved) that if you use sum(intensity) or mean(intensity), your only source of error is segmentation. Then you can have a user click on many correctly segmented cells and get a feeling for a distribution of scores that would be meaningful. This would be faster than drawing ROIs by hand and can be re-used for QC afterwards, as you would get all kinds of cell feature distributions.

tischi commented 4 years ago

@akreshuk

As said, for sure comparing the results with the cohorts is something we should do!
Yes, cell segmentation could be one source of error, so we should say something about the accuracy.
Since the main readout of the assay right now is a ratio between the intensities of infected vs. non-infected it matters for this score how we subtract a correct background. This this may in fact be quite tricky, see e.g. the screenshot in this issue: https://github.com/hci-unihd/antibodies-analysis-issues/issues/46 Thus, if we report the ratio as the main readout I think it would be good to have something to say about its accuracy.

FYI: Proper background subtraction in fluorescence microscopy is a big deal in bioimage analysis. By coincidence, yesterday night I got a mail from my colleagues at the ALMF asking my help to prepare a whole document on "best practices in background subtraction" for our microscopy users. Also, I already had one project where we tried for several weeks and in the end we gave up the whole project, because we could not figure out a robust way to subtract a background.

[EDIT: All of this also depends on whether (a) we mainly want to provide some sort of statistical score for whether the tested person is likely to have developed some immunity, or (b) we want to provide a biophysical measurement that tells something quantitative about the person's antibody binding to the virus]

Anyway, maybe more efficient to discuss these things once we have some actual numbers and maybe during a zoom meeting, e.g. on Monday afternoon.

constantinpape commented 4 years ago

I have been thinking about evaluating different parts of the pipeline for the manuscript and here is what I have been thinking we should do so far:

Measure the segmentation accuracy (prob. via AP) either using cross-validation or some hold out ground-truth images (@lorenzocerrone @wolny and me are working on some more gt images, so we should have enough to do this by middle of next week).
Measure the accuracy/specificity/sensitivity/F1-Score of the infected cell classification, also using the groundtruth images we are also making for this right now. (From what we have seen so far this part of the pipeline that needs to be improved a lot! For now, @imagirom will use the ground-truth we are producing to optimize the parameters he has for this, but maybe we need to actually train a RF based on the features we already extract to make this work properly.)
Measure the overall accuracy/specificity/sensitivity/F1-Score of our readout. This is the most important measure, because it quantifies the whole pipeline and is what we actually care about. However it is still not quite clear how we do this. The best candidate right now is to use the independent patient data vibor has (i.e. the knowledge whether a patient corresponding to a well was infected or not). We will discuss this with Vibor in the next zoom meeting.

For now, @imagirom and me had planned to see what we can set up in 3. and then use this to determine how good the different options for the scores would work.

Regarding comparing to the manual measurements you propose: the issue I see with this, is that it's mostly measuring the segmentation quality, BUT introduces a rather arbitrary margin of error due to the fact that we measure intensities two different ways (fiji and python/pytorch based), where there's no reason to believe that one way is more correct than the other.

akreshuk commented 4 years ago

@tischi , background subtraction is a nasty thing indeed, but how would it affect the automatic and manual measurements differently? But you are completely right to point out multiple sources of error, it looks like there are several pieces in this experiment that affect the final error rate estimate, and we should eventually isolate and estimate the influence of all of them if we want to produce a believable number. No need to discuss this right now though, happy to continue in some meeting once you guys have time to breathe.

tischi commented 4 years ago

I am starting to look at the two plates that we would like to publish... Pasting some initial plots here, probably @imagirom and @Steffen-Wolf could do this more systematically in python.

Plate 943: Good agreement between ratio of sums and ratio of means

The white lines are y = n * x The dots are very close to the y = 1 * x line.

Plate 943: Robust z-score and ratio sum mostly agree

Pretty much on one line. Few outliers. We could dig into those and see whether we can understand why they are different (and what the biological interpretation of this may be).

Plate 435: dos_sums vs ratio_sums

I think (@Steffen-Wolf correct me if I am wrong):

dos =  ( i - c ) / ( i + c )
i... infected
c... control

...one could rewrite...

dos = ( i/c - 1 ) / ( i/c + 1 )

So it essentially is just some mathematical flavour of the ratio i/c.

Not sure, but I don't yet feel comfortable with this score because I don't fully get the mathematical/ biophysical interpretation...maybe someone can help?

I think one advantage of the dos is that we do not have the issue of diving by 0 in case the control cell would become very dim!

imagirom commented 4 years ago

@tischi Very interesting! Quick disclaimer: In the current version all scores are computed with the nuclei filtered out. There is an option to use the nuclei, but that affects the mean as well.

tischi commented 4 years ago

@imagirom

What about this:

for the sum score we leave the nuclei in, and
for the mean score we leave them out ?!

I think biologically and from a microscopy point of view this makes sense and we do not have to have too many version of everything.

@metavibor , ok?!

metavibor commented 4 years ago

@tischi @imagirom ok - for "sum" nucleus should be in, for "mean" it should be out

Steffen-Wolf commented 4 years ago

Hi @tischi ,

thank you for your analysis. With regards to the dos:

First, let me say that the ratio seems like the most intuitive metric to use and the one we "should" use.

I proposed the dos because it has two properties that make it more robust than the ratio. 1) as you said we are unlikely to divide by zero in the dos 2) The difference automatically removes the (constant) background illumination

Point 2 was important at the point of our analysis where we sometimes overestimated the background illumination. I think with our current method there the benefits of dos are getting slim.

Best, Steffen

tischi commented 4 years ago

@Steffen-Wolf

I still find the dos interesting because of (1), because the control cells really could get very dim. However I am not sure (2) is correct, because the background shows up in the denominator and thus would still influence the score.

tischi commented 4 years ago

@metavibor @imagirom @constantinpape I started exploring whether sum or mean based scores could be better. One idea I had was to see how variable either of the scores is among the cell population. To this end, I computed robust versions of the coefficient of variation (sdev / mean -> mad / median, https://en.wikipedia.org/wiki/Coefficient_of_variation ):

cv_control_sums =  IgG_control_mad_sums / IgG_control_q0.5_sums
cv_control_means =  IgG_control_mad_means / IgG_control_q0.5_means

I did this for each well, and then computed the median of all wells for all plates:

   plate_name                        `median(cv_control_means, na.rm = T)`
   <chr>                                                             <dbl>
 1 20200417_132123_311                                              0.118 
 2 20200417_152052_943                                              0.120 
 3 20200417_203228_156                                              0.119 
 4 20200420_152417_316                                              0.116 
 5 20200420_164920_764                                              0.116 
 6 plate1rep3_20200505_100837_821                                   0.185 
 7 plate2rep3_20200507_094942_519                                   0.120 
 8 plate5rep3_20200507_113530_429                                   0.142 
 9 plate6rep2_wp_20200507_131032_010                                0.306 
10 plate7rep1_20200426_103425_693                                   0.185 
11 plate8rep1_20200425_162127_242                                   0.202 
12 plate8rep2_20200502_182438_996                                   0.146 
13 plate9_2rep1_20200506_163349_413                                 0.168 
14 plate9rep1_20200430_144438_974                                   0.210 
15 titration_plate_20200403_154849                                 -0.0264

 plate_name                        `median(cv_control_sums, na.rm = T)`
   <chr>                                                            <dbl>
 1 20200417_132123_311                                              0.433
 2 20200417_152052_943                                              0.439
 3 20200417_203228_156                                              0.417
 4 20200420_152417_316                                              0.409
 5 20200420_164920_764                                              0.402
 6 plate1rep3_20200505_100837_821                                   0.422
 7 plate2rep3_20200507_094942_519                                   0.458
 8 plate5rep3_20200507_113530_429                                   0.436
 9 plate6rep2_wp_20200507_131032_010                                0.555
10 plate7rep1_20200426_103425_693                                   0.433
11 plate8rep1_20200425_162127_242                                   0.446
12 plate8rep2_20200502_182438_996                                   0.462
13 plate9_2rep1_20200506_163349_413                                 0.475
14 plate9rep1_20200430_144438_974                                   0.401
15 titration_plate_20200403_154849                                 -0.479

This shows lower CV for the mean based scores. From visual inspection I would have in fact expected the opposite, because to me larger cells appeared to have a dimmer mean signal and I would have expected this to compensate when computing the sum. However, due to the cell cycle, I guess we can expect cell volume changes by about a factor of 2. If these volume changes correlate with the cell's contact area on the coverslip, the mean intensity could in fact be expected to be more stable than the sum.

Anyway, for practical reasons (limited time), I would be OK with just going for the mean based scores, as they appear less variable.

constantinpape commented 4 years ago

We all agreed now to keep with the mean score for now, Tischi's measurement of the distribution spread provide a good argument. I will still leave this open for the time being, it's a nice discussion issue :)

tischi commented 4 years ago

@imagirom @metavibor @constantinpape

I had another idea for the z-score. Since the intensity in the infected cells appears to be quite variable (see e.g. here https://github.com/hci-unihd/antibodies-analysis-issues/issues/80) and also since we have much more infected cells than controls, maybe the better way to compute the z-score could be:

z-score = ( median( infected ) - median( ctrl ) ) / mad ( infected )

The difference being that we divide by mad( infected ) rather than mad( ctrl ).

Reasoning:

We typically have many more data points to measure mad( infected ) and thus the value should be more stable.
There is a significant variability in the infected cells, which we will take into account like this.

One could of course also think about something like below in order to take both mad into account:

score = ( median( infected ) - median( ctrl ) ) / ( 0.5 * ( mad ( infected ) + mad ( ctrl ) ) )

or geometric mean:

score = ( median( infected ) - median( ctrl ) ) / sqrt( ( mad ( infected )^2 + mad ( ctrl )^2 ) )

Looks sensible to me, but then I do not how this is called in statistics.

It looks very similar to equation 9.1.4 here: https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Book%3A_Introductory_Statistics_(Shafer_and_Zhang)/09%3A_Two-Sample_Problems/9.01%3A_Comparison_of_Two_Population_Means-_Large%2C_Independent_Samples

...however we would not consider the populations sizes n1 and n2.

Any opinions?

And, just to have it somewhere, a cell-based plot of mean IgG intensity against mean virus intensity (those are three wells from plate K25, A01=positive, B08=negative, C01=borderline-positive).

sciai-lab / batchlib

Which score do we use? #76

Plate 943: Good agreement between ratio of sums and ratio of means

Plate 943: Robust z-score and ratio sum mostly agree

Plate 435: dos_sums vs ratio_sums