saeyslab / CytoNorm

R library to normalize cytometry data
33 stars 6 forks source link

Choosing appropriate reference control #15

Closed Troylimyj closed 3 years ago

Troylimyj commented 4 years ago

Dear Sofie,

I noticed that Cytonorm does not normalise channels where there is no signal in the control sample. How would you recommend getting around this? I understand that appropriate control sample choice is crucial but it will be difficult to find a sample with signal in all channels. Alternatively would you run the algorithm twice in 2 different control samples in order to cover all the channels?

SamGG commented 4 years ago

Hi, I would run CytoNorm with two sets of reference samples. Each channel should normalized only once by the one set or the other. Channels in common would served as controls. Not tested, but that's what I would do. Best.

tomashhurst commented 4 years ago

@Troylimyj do you mean in the sense that reference controls might not have activation markers, and things like that?

This is always going to be a fundamental restriction to how alignment works -- in the case of CytoNorm, the reference controls are unlikely to have 'activation' markers (e.g. HLADR and CD38 on T cells will be elevated in disease patient blood, but not healthy controls etc). One approach here is that you only align the stable channels where you do have expression in the reference controls (e.g. CD3, CD19 etc) and just leave the dynamic channels (CD38, etc) as raw. You can then cluster on the aligned stable markers to get the major subsets, and then investigate the dynamic channels more specifically. It does mean that you might find batch effects in CD38 expression, but if you are essentially cutting this down to +/- expression, then it's pretty easy to determine a cutoff for each sample using a simple gate (thought reading out the raw MFI would still include a batch effect).

This relates to how you want to use clustering in your analysis. One method is to cluster on everything (stable and dynamic markers) and you'll get clusters for every possible phenotype (i.e. separate clusters for T cells and 'activated' T cells etc). Alternatively, you can just cluster on the stable markers to get the major subsets, and then look for changes on activation markers within each of those clusters. My above solution to your problem mostly lends itself to the second clustering approach.

tomashhurst commented 4 years ago

@Troylimyj there is another option, but I raise it more as a discussion point, rather than a recommendation. There are some hacks where you can align entire batches with each other, assuming you have a range of normal and 'activated' controls in each batch. This approach gets a little more crazy, and there is a lot more work to validate it, but it can possible address the limitations you pointed out, but it can also introduce artefacts in your data that are not actually there.