saeyslab / CytoNorm

R library to normalize cytometry data
33 stars 6 forks source link

Infinite max range on normed FCS files #4

Open emrizzi opened 4 years ago

emrizzi commented 4 years ago

Hello - Firstly thank you so much for this code, it will be a game changer for analyzing CyTOF data between batches.

It has worked perfectly for me to normalize the FCS files and analyze the resulting files using other R packages, however some of my collaborators do not have coding experience and prefer the user-friendly versions of viSNE and FlowSOM through Cytobank. I've had trouble getting the normalized FCS files to be compatible with Cytobank. Originally I thought it may have been an issue with my FCS files, so I then normalized the flow repository files provided and I think it is an issue with the normed output files.

The algorithm changes the max range for the expression of each channel in a way that causes infinite outputs for some channels as show below: image image

Unfortunately the code in Cytobank requires the max range to be a finite value in order to do any higher order analyses (viSNE, FlowSOM, CITRUS, etc.). I've tried to play around with the code a bit to manually set the max range but haven't been successful. Do you have a suggestion as to how to address this issue?

Thanks! Elise

emmanuelaaaaa commented 4 years ago

Hello,

I have been coming across similar issues so I was wondering if you managed to resolve this, @emrizzi . For me, it's not just the range that has the infinite values but the "expression" values as well (assay(sce, "exprs") from the CATALYST object). I guess that was similar for you? In which case, how did you use them on the downstream analysis? FlowSOM clustering doesn't accept Inf values and I'm pretty sure TSNE/UMAP doesn't either (if you are unlucky and the subsampling done for the TSNE/UMAP includes the cells which have Inf in some markers).

Many thanks and best wishes, Emma

SofieVG commented 4 years ago

Dear Emma and Elise,

Not yet a solution and I should certainly investigate further in this issue, but one option as a temporary solution could be to use the theoretical maximum that you would expect possible based on the range from the original file or e.g. compute the 99.9% quantile for your markers and replace all higher values (including the infinity values) by this value, as a kind of truncation step.

On Thu, 16 Jul 2020 at 11:56, Emma notifications@github.com wrote:

Hello,

I have been coming across similar issues so I was wondering if you managed to resolve this, @emrizzi https://github.com/emrizzi . For me, it's not just the range that has the infinite values but the "expression" values as well (assay(sce, "exprs") from the CATALYST object). I guess that was similar for you? In which case, how did you use them on the downstream analysis? FlowSOM clustering doesn't accept Inf values and I'm pretty sure TSNE/UMAP doesn't either (if you are unlucky and the subsampling done for the TSNE/UMAP includes the cells which have Inf in some markers).

Many thanks and best wishes, Emma

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/saeyslab/CytoNorm/issues/4#issuecomment-659306984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOS725Z42ORIUWOJTAN2UDR33FENANCNFSM4KINGYKA .

emmanuelaaaaa commented 4 years ago

Hi Sofie, That is very helpful! Thanks! As a side question to the range of the expression observed, I also get some negative values after normalisation, that could also introduce some problems on the downstream analysis. Do you think I can replace those by 0? Is there any reason you can think, that I shouldn't? Many thanks and best wishes, Emma

tomashhurst commented 4 years ago

@emmanuelaaaaa @emrizzi @SofieVG this is interesting. @ghar1821 and I have occasionally found the odd couple of cells that have been given extremely high values (or extremely high negative values) after alignment. So perhaps the max range recorded there is because you have a couple of cells with extreme values. We had a quick look into it, but aren't sure why they change -- partly because there are only very few cells where this happens. @SofieVG did you figure out any possible causes? @ghar1821 and I just filtered them out of our dataset before we proceeded with the analysis.

In terms of a workaround solution for the ranges, you could pull the files into R, modify the max values directly (as Sofie suggested, using 99.9th percentile or something similar), and then re-export as an FCS file. You could run it in a loop over all the samples to save you having to sit there and modify each sample. Here is a quick bit of code that could probably do it (I've just pulled some bits out of https://github.com/sydneycytometry/CSV-to-FCS, but I haven't tested this in R, so good chance it won't work perfectly as is):

Read the FCS file into R

library('flowCore')

# 'file' here is the name of an FCS file in your working directory

dat <- exprs(read.FCS(file, transformation = FALSE))
dat <- dat[1:nrow(dat),1:ncol(dat)]

# dat is a now a data.frame of parameters (cols) vs cells (rows)

Normally you could calculate and save the max and min of each column like this:

      metadata <- data.frame(name=dimnames(dat)[[2]],desc=paste('column',dimnames(dat)[[2]],'from dataset')) # or copy the column metadata from when the FCS file gets read in

      #metadata$range <- apply(apply(dat,2,range),2,diff)
      metadata$maxRange <- apply(dat,2,max) # uses 'apply' to calculate the max of each column of the table
      metadata$minRange <- apply(dat,2,min) # uses 'apply' to calculate the min of each column of the table

But in this case you could replace 'min' and 'max' with something finds the 99.9th percentile (instead of max) and let's say the 0.1th percentile (instead of min). The quantile function should do this:

metadata$maxRange <- apply(dat, 2, quantile(x, probs = .999))
metadata$minRange <- apply(dat, 2, quantile(x, probs = 0.001))

I don't often calculate metadata$range but you might need it for CytoBank -- it could be calculated as the 99.9th percentile minus the 0.1th percentile, and calculated using apply as above.

You could also just use an expected max/min-- i.e. 262000 for flow data (or ~2x10^4 ish for CyTOF data) and whatever a typical minimum after compensation is (-1000?).

Then you can construct a flowFrame and save the FCS file

      dat.ff <- new("flowFrame",exprs=as.matrix(dat), parameters=AnnotatedDataFrame(metadata))
      write.FCS(dat.ff, paste0("Sample.fcs"))

It's a bit more fiddling with the files, but shouldn't be too difficult to setup in a reproducible script. If it helps, tomorrow I can test the above code and re-post it a working version here. Important to mention, I've never taken FCS files from R into CytoBank, so I'm not sure if other issues might come up.

thinkCara commented 3 years ago

I have just run into this very problem when using my CytoNormed files on Cytobank. Does anyone have a tested R script that fixes the infinite maxRange problem?