saeyslab / CytoNorm

R library to normalize cytometry data
32 stars 6 forks source link

FlowSOM fail on new normalized FCS files #20

Open EAC-T opened 3 years ago

EAC-T commented 3 years ago

Hi everyone,

After I normalize the files, I upload them to cytobank to do FLOWSOM clustering, it keeps failing, any idea why? Also the fcs file size of the normalize file is 3X smaller than that before normalization, is there a reason for that?

Thank you a lot

SamGG commented 3 years ago

Hi, Failure: try to load the files with another software, e.g. FlowJo or Omiq, in order to repeat the error? check compliance with http://bioinformin.cesnet.cz/flowIO/. Size: if the files before normalization are directly originating from a CyTOF, it is known that there is an overload of at least a factor 2 in such an FCS. If not, no idea. Best.

EAC-T commented 3 years ago

Hi @SamGG I did check my FCS files as you suggested, I think one potential reason is that I have few values with very big numbers something with 1.07+e30, is that indicate that the normalization failed? Have you encountered such a problem before? Thank you a lot

SamGG commented 3 years ago

I didn't encounter such a problem. Sofie will probably answer you soon.

SofieVG commented 3 years ago

Hi,

This is indeed an issue I have also encountered once in a while. Typically it is related to the training data not being optimally representative of the data you are trying to normalize, so that in using the splines some extrapolation happens. One option to minimize this effect is to use the limits parameter, where you can pass some values which will be introduced in the spline as identity points. If you place those at some values which you expect are on the borders of the range (e.g. after transformation for cytof data typically around 0 and 8), this can help to encourage the spline to stay closer to the identity function out of this range. Alternatively, it might be a good idea to double check the FlowSOM model, and see if the clusters are making sense. Maybe you are overclustering the data, causing some small clusters with not enough data present to appropriately estimate the spline. Making some figures of the splines (e.g. using the plot = TRUE parameter) can be helpful to pinpoint the exact issue.

All the best, Sofie

tomashhurst commented 3 years ago

@EAC-T you can also try using fewer metacluster to try and generate a slightly more accurate model -- this can often help prevent those extremely high values appearing the resulting data.