saeyslab / CytoNorm

R library to normalize cytometry data
33 stars 6 forks source link

Cytonorm creating highly negative or very large values #28

Closed hrj21 closed 3 years ago

hrj21 commented 3 years ago

Hi all,

Thank you for the excellent package. I have used CytoNorm on previous projects and been really happy with the results. On my current CyTOF project, I'm finding that after batch normalisation, CytoNorm seems to create some highly negative alignment values, and also some very high positive ones.

The plots below show a couple of examples or markers plotted against themselves pre and post alignment.

FceR CD45_aligned

I'm afraid I'm not able to provide a reproducible example of this yet because I don't have permission to share the data, but have you seen this before and do you know what causes it? The number of events that are very low or very high is very small, so I'm happy excluding them, but I'm worried this is a symptom of something wrong?

The data have been compensated (nnls), asin(x/5) transformed, and then put through cytonorm.

SofieVG commented 3 years ago

Hi @hrj21,

I have seen similar issues, it is typically occuring if some extrapolation is happening. This means there are some cells in the real samples which fall out of the range of values present in the training samples, and the spline might not be well representative of the learned shift anymore in those regions. I think I saw it typically more with high positive values than with high negative values though, that's surprising me a little (because in my experience with cytof there are typically enough zeros to be well represented everywhere). I am typically also excluding those cells. One approach to limit the effect can be to add the "limit" parameter in the normParams argument. Values in these parameter will be added as identity points during the estimation of the spline. When passing values out of the expected range (e.g. 0 and 10) this might help to ensure the spline stays closer to the identity function out of the trained region. For further investigation of what is happening, you could have a look at the splines trained by setting "plot" to TRUE. This will, for every cluster, show the quantiles detected and the splines fitted through them.

Hope this helps, Sofie

hrj21 commented 3 years ago

Ah that makes sense. Perhaps the batch control files were not truly representative of the ranges found in the experimental files. I will try the limit parameter and the plot argument as you suggest, thank you.

SamGG commented 3 years ago

Hi, I am not using cytoNorm but interested in. I am not in favor of setting a transformation/function of the same channel that is different for each FlowSOM cluster. This results in what we observe on those plots IMHO. Second, I don't see the aim in normalizing CD45 that seems to result from a gating of the positive cells. I am very interested in sharing your both views about these points. Best, Samuel

SofieVG commented 3 years ago

Hi @SamGG ,

If you prefer not to have separate transformations for each FlowSOM metacluster, you could immediately use the the QuantileNorm.train and QuantileNorm.normalize functions, which does then not involve a FlowSOM step. When using the approach per metacluster, it is indeed important to ensure that the clustering level is "larger" than the batch effect. This can be done e.g. by inspecting the FlowSOM object used and the testCV function. We saw some cases in which different cell types where impacted differently by the batch effect (as described in our paper, https://onlinelibrary.wiley.com/doi/full/10.1002/cyto.a.23904), where applying the same function everywhere would not optimally resolve the issue. The number of metaclusters I use in this step is typically lower than the number of metaclusters I use for actual down stream analysis. However, in general I would agree, the simpler the algorithm, the smaller the chance for strange artefacts to occur. So I also prefer to go as simple as possible, while still having sufficient complexity to resolve the batch issues detected.

SamGG commented 3 years ago

Hi @SofieVG ,

I tried the QuantileNorm functions and found them simpler and safer. I think I already agreed on that in another question/issue.

I read your article but I was not convinced by the figures. I am very glad that you share points and views that I didn't apprehend in the article. Maybe I missed a) the criterion to decide to cluster or not and b) the low number of clusters, which makes sense.

Thanks a lot for sharing your views, it always a pleasure to read from you.

@hrj21 I would be glad to hear from you too, for example about the number of clusters you defined, if you tried the alternative functions proposed by Sofie or any view of batch correction.

@SofieVG if we set number of cluster = 1, does CytoNorm fall back to QuantileNorm functions? will CytoNorm complain?

hrj21 commented 3 years ago

Hi @SamGG, I understand your apprehension about applying different splines per cluster, but I agree with @SofieVG that different cell populations can be impacted to differing degrees by a batch effect. For example, imagine running an experiment across 5 runs/batches and you create a new antibody cocktail for each batch. Let's say you add a little more of the CD4 antibody in one run, then a little less CD11b in another run through pipetting error (an error that isn't constant across each antibody), then cell types defined by those antigens will experience different batch effects to those that aren't. I tend to cluster into only major cell types for cytonorm, and use the testCV() function to help tune the number of clusters a little.