saeyslab / CytoNorm

R library to normalize cytometry data
33 stars 6 forks source link

Any plans for a python implementation? #45

Closed hrj21 closed 1 month ago

hrj21 commented 1 month ago

Not an issue per se so please close if you answer. I've been delving into Python for flow recently, with one of the stark differences being there seems to be no control sample-based batch correction tool. As CytoNorm is so fundamental to our cytometry pipelines now, is there any plan for an implementation to be released for Python?

Best wishes and keep innovating! Hefin

tomashhurst commented 1 month ago

Your timing is impeccable: https://www.biorxiv.org/content/10.1101/2024.07.19.604225v1.full.pdf


From: Hefin Ioan Rhys @.> Sent: Wednesday, July 24, 2024 8:59:00 PM To: saeyslab/CytoNorm @.> Cc: Subscribed @.***> Subject: [saeyslab/CytoNorm] Any plans for a python implementation? (Issue #45)

Not an issue per se so please close if you answer. I've been delving into Python for flow recently, with one of the stark differences being there seems to be no control sample-based batch correction tool. As CytoNorm is so fundamental to our cytometry pipelines now, is there any plan for an implementation to be released for Python?

Best wishes and keep innovating! Hefin

— Reply to this email directly, view it on GitHubhttps://github.com/saeyslab/CytoNorm/issues/45, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACZYS65D2AY5AUHA7HHV4J3ZN6CHJAVCNFSM6AAAAABLMHPK6WVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDOMRXGI3DINQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

hrj21 commented 1 month ago

Wow! Outstanding! I promise I'm not psychic...

SamGG commented 1 month ago

@SofieVG Great job! Some comments below. If some of you can help me... Are marker and channel a different layer of information? p.4 "per marker, cluster, channel and batch" It would be great to know which dataset is used in figure 1C. I don't catch what is a point on fig 1C left. On fig 1D, I don't really understand how to interpret EMD and MAD. I don't feel it is straightfoward. All MAD values are lower than 2, which I interpret as MAD are mainly influenced by the negative peak and do not show me how the positive peaks are transformed. Concerning EMD, I don't catch to which reference the EMD is computed. The Python langage starts with an upper case P.

TarikExner commented 1 month ago

Dear Samuel,

thank you very much for the comments!

Regarding your first point: That is a typo, marker and channels are the same entity and do not carry additional information.

For Figure 1C, we used the dataset of Sofie (the first dataset labeled "Van Gassen" in Table 1), and the scatter plot in 1C on the left is displaying a conventional scatter plot, where each dot corresponds to a single event. For that plot, the data were subsampled to 500 events in order to not overcrowd the plot.

Regarding Figure 1D: I do agree that the description of the plots could/should be much more elaborate and explanatory.

So for the MAD, we basically treat this as a measure of "biological signal" and ideally, the MAD should not be altered after normalization but rather be conserved. Therefore, the ideal point distribution would be on the line y = x. We defined MAD cutoffs and marked them as a red line in order to identify channels where this conservation does not hold true entirely. The MAD is calculated per sample and channel and plotted as such.

If I understand your point correctly, you are concerned that in cases with a large fraction of negative cells the MAD might not be the correct measure to evaluate the conservation of variability in marker-positive cells? I do see your point! A possible way to tackle this might be to evaluate the MAD per cluster and channel, rather than per channel alone. That way, we could increase the "concentration" of marker positive cells in a CytoNorm intrinsic way and choose the clusters with the highest expression in order to evaluate the conservation in marker-positive cells, or at least approximate it better. We already implement a cell-type specific MAD calculation, which however needs manual annotation and is less fine-grained compared to clustering.

Regarding the EMD: After normalization, we expect the EMD as a measure of batch-effect to become smaller (and ideally 0) and the red line indicates the line where the EMD would stay constant, meaning that dots above this line represent channels where the EMD was successfully lowered after normalization. For the calculation, we take the maximum pairwise EMD between batches per channel, first for the unnormalized samples and then for the normalized ones. We plot them based on the annotations: It is possible to plot them per cell-type and channel (as in the original CytoNorm paper) or only per channel (as in the preprint).

I hope that answers your questions! Thanks again for the constructive feedback!

SamGG commented 1 month ago

Dear Tarik, Thanks for your elaborated answer. Maybe you should add these descriptions to the M&M. Concerning MAD, you catched my point. Evaluating the MAD per cluster and marker is interesting but raises the point of matching clusters before and after normalisation if these two clusterings are not the one used in CytoNorm. Concerning EMD, now I get the calculation. A complementary criterion would be the median pairwise EMD in order to be less sensitive to outliers. IMO, EMD is still not sensitive enough when a large fraction of cells are negative. I hope CytoNormPy will be accepted soon.