taborlab / FlowCal

Python Flow Cytometry Calibration Library
MIT License
49 stars 23 forks source link

Suport MEF transformation with linear amplifier #195

Closed castillohair closed 8 years ago

castillohair commented 8 years ago

It turns out that the SEA's flow cytometer can export data in FCS 2.0 in both linear and log scale, and 3.0 in linear scale with floating points. I got all three versions after one of my experiments, including several cell samples and beads. @JS3xton if you're interested in looking at them, I can put them on Dropbox or something.

This issue will be considered resolved when the three datasets result in the same MEF values for each cell sample when run through the Excel UI. There may be some issues in processing integer linear data, so at the very least integer log data and floating-point linear data should work.

I will now show the result of running the Excel UI with the three sets of files, as performed by the current floating-mef branch (43a9dcb5dad54e789e89416e768e838aab01c10e).

FCS2.0 with log amplifier, 10-bit integer

This is the setup that we currently support, so the following results are what we would consider correct.

Beads

Initial density/histogram plot: density_hist_b0001 2D Clustering: clustering_b0001 Populations used for standard curve generation: populations_apc-a_b0001 Standard curve: std_crv_apc-a_b0001

Cell sample

s0024

FCS2.0 with linear amplifier, 10-bit integer

Beads

Initial density/histogram plot: density_hist_b0002 2D Clustering: clustering_b0002 Populations used for standard curve generation: populations_apc-a_b0002 Standard curve: std_crv_apc-a_b0002

Cell sample

s0052

Obviously all of this is super incorrect. Interestingly, clustering seems to work fine.

FCS3.0 with linear amplifier, floating point

Beads

Initial density/histogram plot: density_hist_b0003

2D Clustering: clustering_b0003

Populations used for standard curve generation: populations_apc-a_b0003

Standard curve: std_crv_apc-a_b0003

Cell sample

s0080

Also super incorrect. Note that the failure modes here seem to be similar to the case of FCS2.0 integer with linear amplifier, suggesting that amplification type and not resolution will be the major challenge.

castillohair commented 8 years ago

First, I wanted to see whether data could be processed in a.u., and displayed correctly without the Excel UI. I wrote the following script:

import FlowCal

if __name__ == '__main__':
    d2lin = FlowCal.io.FCSData('FCFiles_2.0_linear/B_subtilis_Tube_024.fcs')
    d2log = FlowCal.io.FCSData('FCFiles_2.0_log/B_subtilis_Tube_024.fcs')
    d3 = FlowCal.io.FCSData('FCFiles_3.0/B_subtilis_Tube_024.fcs')
    data = [d2lin, d2log, d3]

    # Print channel names, ranges, amplification types
    for d, name in zip(data, ['2.0, lin', '2.0, log', '3.0']):
        print "\nFor file version {}:".format(name)
        for ch, r, a in zip(d.channels, d.range(), d.amplification_type()):
            print "Channel {:13}, range: {:15}, a.t.: {}, min: {:14}, max: {:14}".\
                format(ch, r, a, min(d[:,ch]), max(d[:,ch]))
    print ""

    # Transform to rfi
    print "Transforming to rfi..."
    data_rfi = []
    for d in data:
        data_rfi.append(FlowCal.transform.to_rfi(d))

    # Density gating
    print "Performing density gating..."
    data_gated = []
    gate_contour = []
    for d in data_rfi:
        g, __, c = FlowCal.gate.density2d(
            d,
            channels=['FSC-A', 'SSC-A'],
            gate_fraction=0.5,
            xscale='log',
            yscale='log',
            full_output=True)
        data_gated.append(g)
        gate_contour.append(c)

    # Plot FSC/SSC/APC-A
    print "Generating plots..."
    for d, dg, gc, name in \
            zip(data_rfi, data_gated, gate_contour, ['2_lin', '2_log', '3']):
        FlowCal.plot.density_and_hist(
            d,
            gated_data=dg,
            gate_contour=gc,
            density_channels=['FSC-A', 'SSC-A'],
            density_params={'xscale': 'log', 'yscale': 'log', 'mode': 'scatter'},
            hist_channels=['APC-A'],
            hist_params={'xscale': 'log', 'bins': 128},
            savefig='density_hist_{}'.format(name))

The resulting terminal output is:

For file version 2.0, lin:
Channel FSC-A        , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:           1023
Channel SSC-A        , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              2, max:            607
Channel FITC-A       , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:              8
Channel PE-A         , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:              4
Channel PerCP-Cy5-5-A, range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:              8
Channel PE-Cy7-A     , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:             10
Channel APC-A        , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:           1023
Channel APC-Cy7-A    , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:             24
Channel Time         , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:              7

For file version 2.0, log:
Channel FSC-A        , range: [0.0, 1023.0]  , a.t.: (4.0, 1.0), min:              0, max:           1023
Channel SSC-A        , range: [0.0, 1023.0]  , a.t.: (4.0, 1.0), min:            358, max:            965
Channel FITC-A       , range: [0.0, 1023.0]  , a.t.: (4.0, 1.0), min:              0, max:            487
Channel PE-A         , range: [0.0, 1023.0]  , a.t.: (4.0, 1.0), min:              0, max:            417
Channel PerCP-Cy5-5-A, range: [0.0, 1023.0]  , a.t.: (4.0, 1.0), min:              0, max:            495
Channel PE-Cy7-A     , range: [0.0, 1023.0]  , a.t.: (4.0, 1.0), min:              0, max:            516
Channel APC-A        , range: [0.0, 1023.0]  , a.t.: (4.0, 1.0), min:              0, max:           1023
Channel APC-Cy7-A    , range: [0.0, 1023.0]  , a.t.: (4.0, 1.0), min:              0, max:            608
Channel Time         , range: [0.0, 1023.0]  , a.t.: (0.0, 0.0), min:              0, max:              7

For file version 3.0:
Channel FSC-A        , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -2440.37988281, max:       262143.0
Channel SSC-A        , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min:  658.469970703, max:   155479.28125
Channel FITC-A       , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -114.209999084, max:  2105.12988281
Channel PE-A         , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -232.649993896, max:  1119.53991699
Channel PerCP-Cy5-5-A, range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -119.849998474, max:  2263.05004883
Channel PE-Cy7-A     , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -573.869995117, max:  2739.62988281
Channel APC-A        , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -455.129974365, max:       262143.0
Channel APC-Cy7-A    , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -112.319992065, max:  6243.11962891
Channel Time         , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min:            0.0, max:  2018.59997559

Transforming to rfi...
Performing density gating...
Generating plots...

Note that the FCS3.0 floating-point files have negative numbers, as expected from the standard.

The following plots are produced:

density_hist_2_lin.png: density_hist_2_lin

density_hist_2_log.png density_hist_2_log

density_hist_3.png density_hist_3

Obviously, the numerical values are different from one another (arbitrary units). It really seems like the integer linear file is trying imitate what the floating-point linear file is doing, but chopping off a little more than the lowest 2 decades of info (since the lowest values for linear integer are zero and one). This results in a weird FSC/SSC diagram for integer linear. Nevertheless, density gating seems to perform well in all cases.

This shows that with the current develop version, a user could work with FlowCal's Python API in a.u. without issues. Logicle plots may be desirable, though. Besides that, the only remaining challenge to work with floating linear data seems to be MEFing.

castillohair commented 8 years ago

I want to point out a few oddities about FCS file formats. I've seen in several places that FCS 2.0 files are associated with integer representation, and FCS 3.0 with integer or floating point, even though the standard allows for floating point in FCS2.0 as well. For example, this is what we have found with the file repository (#203, notice that no FCS2.0 are DATAYPE='F') and the acquisition software of the SEA's flow cytometer (FCS2.0 is saved as integer, 3.0 as floating point).

Another interesting way to refer to these file versions is "analog" (FCS 2.0) and "digital" (FCS 3.0) (http://docs.flowjo.com/vx/graphs-and-gating/gw-transform-overview/). More interestingly, the following is said in the article that initially described the logicle transformation (http://onlinelibrary.wiley.com/doi/10.1002/cyto.a.20258/abstract) (emphasis mine):

...Data sets resulting from computed compensation commonly (and properly) include populations whose distributions extend below zero. (When analog compensation is used, such distributions should also appear, but the electronic implementations distort or truncate the distributions, so that negative values are suppressed).

What I get from this is that integer data saturates at both ends of its range, whereas floating-point data does not (remember that the standard does not require event values to be within 0 and $PnR).

Why is this relevant to us? Saturation leads to incorrect values, and that is why we use gate.high_low. However, this may not be necessary with floating point data. The Excel UI will probably need to distinguish between these two cases, and use gate.high_low only if necessary.

castillohair commented 8 years ago

I think up to commit 4bd6c1e8e688bf052e2cd59e4cf4675bec0144b3 we're most of the way towards solving the issue. I will now show the current results:

MEFing with the Python API

The following was obtained using the Python API (no Excel UI). The beads files shown above were used to obtain transformation functions.

Standard curves

integer, linear

std_crv_apc-a_2_lin

integer, log

std_crv_apc-a_2_log

float

std_crv_apc-a_3

Standard curve parameters

The bead model equation is the following:

m*log(fl_au[i]) + b = log(fl_mef_auto + fl_mef[i])

The following parameters were obtained from the three beads files:

beads file m b fl_mef_auto
Integer, linear 1.01 2.55 104.75
Integer, log 1.01 0.26 112.21
Floating point 1.01 -3.05 111.88

It is interesting to note that the only parameter that changes significantly is the intercept b. fl_mef_auto is more similar between the integer log and the floating-point dataset. A slightly different standard curve is to be expected for the case of integer linear because the data is not that good, particularly for low fluorescence values. Note also in the plots above that the number of logs in the x axis is almost exactly the same as the number of logs that the standard curve covers in the y axis, which is a reflection of the slope parameter being so close to 1.

MEF'd data

Here I picked a sample with relatively high fluorescence. There are still problems with low fluorescence samples, as I will detail in a later post.

integer, linear

density_hist_mef_2_lin

integer, log

density_hist_mef_2_log

float

density_hist_mef_3

Statistical analysis

All histograms above seem to have more or less the same MEF fluorescence value. Let's corroborate that by looking at the numbers:

dataset median APC-A
Integer, linear 455.36
Integer, log 465.81
Floating point 467.35

The values are very similar to each other (CV = 1.15%), although the integer linear dataset is more different (2.27% difference between integer linear and log) to the other two, which differ by only 0.33% from each other.

It is nice seeing that the linear integer case is performing relatively well, even though is such a weird dataset. I'm not too concerned with taking this to the same performance as the other two, but I'm happy to hear suggestions.

castillohair commented 8 years ago

The Excel UI has also been modified to work with floating-point files. The most important modifications are:

Using the same commit, the results of running the example files with the Excel UI are pretty much the same, except for density gating of the beads:

density_hist_b0002

Because everything is being processed in log scale, the FSC/SSC density diagram of beads looks different. However, elimination of the secondary peaks in the fluorescence channels is achieved anyway. I would argue that this is the most correct way to do gating, in terms of being consistent with everything else.

The identified peak values change a little bit, though, because the gate is now different. This is a comparison of the bead model parameters for B0002 FL1 with 1.0.0 and the current commit:

Version m b fl_mef_auto
1.0.0 1.0571538506455078 2.43684936642 1180.28240707
4bd6c1e8e688bf052e2cd59e4cf4675bec0144b3 1.05720955531 2.43642936538 1180.03466155
Error (%) 0.0053 0.017 0.021

(Note that m from the model in 1.0.0 was converted to m from this model by using 256*m/log(10).

This changes the MEF values of the corresponding samples, although negligibly. The following shows percent difference between the different statistics of FL1 of the samples that use B0002 in percentage, with the method in 1.0.0 and the current commit:

image

In summary, the only change that affects how we run our samples is the fact that beads are gated in log scale, which does not seem to affect the final MEF fluorescence of samples.

castillohair commented 8 years ago

The following are some of the results of running the Excel UI agains the SEA dataset.

Beads

Beads plots produced by the Excel UI are almost identical to the plots presented above using the Python API. The parameters obtain in one run are the following:

dataset m b fl_mef_auto
Integer, linear 1.01912269529 2.52996967802 104.313744082
Integer, log 1.01167477702 0.264344458589 112.16949497
Floating point 1.01315711837 -3.05889290482 111.522298167

Not very different from the numbers obtained with the Python API, even though the sequence of gating steps and the gating fraction were not identical.

Cell data with high fluorescence

Plots are also very similar to the ones presented above. Median fluorescence is as follows:

dataset median APC-A
Integer, linear 456.579903810335
Integer, log 469.519100415878
Floating point 466.794512171886

Cell data with low fluorescence

It turns out that samples with low fluorescence are saturating to the left, so the data itself is not very useful. However, it shows some of the intricacies of floating-point data. Here are the median APC values:

dataset median APC-A
Integer, linear 12.5531254952804
Integer, log 16.3462793998462
Floating point 0.624765572644506

Why they are so different can be understood by looking at the plots below.

Integer, linear

density_hist_mef_2lin_low Linear integer data is probably not very reliable given the resolution, so not much can be said besides the data sucks.

Integer, log

density_hist_mef_2log_low

A lot of saturated values are present. If these are eliminated (as the Excel UI would actually do, not shown here) almost half of the population is lost. The remaining events have a larger value than the population, therefore the median fluorescence increases.

Floating-point

density_hist_mef_3_low

Just by looking at the plot in the middle, one would be tempted to say that the median fluorescence is 10-20 MEF. However, the linear plot below shows that the population is actually symmetrical around zero, and there is another peak at the left (perhaps that's how floating point data saturates?). Nevertheless, the log plot is misleading at presenting the central tendency of the population, which actually seems more like zero in the linear plot. A logicle plot would improve visualization of floating-point data.

The problem with geometric statistics

Geometric statistics are, strictly speaking, not defined if negative elements exist. Since some events with negative or zero value exist in all samples, the Excel UI is currently not calculating geometric means for floating point data. This is not explicit, but an effect of numpy.gmean's results being nan and pandas writing them as empty cells in the output file. This is probably not bad for something like the file shown above, but it is problematic for samples like this:

density_hist_mef_3_high

There are 2 events less than zero, but the whole calculation of the geometric mean is invalidated. I think the best compromise is to calculate geometric statistics on events greater than zero, and throw a warning stating the percentage of events that are lower or equal to zero. That way, in combination with the histogram, the user can decide whether these statistics are meaningful or not.

castillohair commented 8 years ago

Summary

Progress so far

Important changes to the MEF module

Important changes to the Excel UI

To do

@JS3xton if you have any feedback, now it's the time.

castillohair commented 8 years ago

Logicle plotting is now implemented. It required additional classes in the plot module (for custom scaling in matplotlib). There was also a small addition to FCSData.hist_bins to handle the new scaling.

The following was run against f99ccc0e940377cf1902e361ac23b484024cfde7:

import FlowCal

d1 = FlowCal.io.FCSData('FCFiles_3.0/B_subtilis_Tube_001.fcs')
d2 = FlowCal.io.FCSData('FCFiles_3.0/B_subtilis_Tube_008.fcs')

FlowCal.plot.density2d(
        d1,
        channels=['PE-A', 'APC-A'],
        xscale='logicle',
        yscale='logicle',
        mode='scatter',
        savefig='logicle/fl_logicle.png')

fl_logicle

FlowCal.plot.scatter2d(
        [d1, d2],
        channels=['FITC-A', 'APC-A'],
        xscale='logicle',
        yscale='logicle',
        savefig='logicle/scatter_logicle.png')

scatter_logicle

FlowCal.plot.scatter3d_and_projections(
        [d1, d2],
        channels=['FITC-A', 'PE-A', 'APC-A'],
        xscale='logicle',
        yscale='logicle',
        zscale='logicle',
        savefig='logicle/scatter3d_and_projections_logicle.png')

scatter3d_and_projections_logicle

FlowCal.plot.hist1d(
        [d1, d2],
        channel='APC-A',
        xscale='logicle',
        savefig='logicle/hist_logicle_2.png')

hist_logicle_2

I will now integrate this into the Excel UI. As per my previous discussion with @JS3xton, I'll try to make all default to logicle instead of log, and see how the plots with our old file look like.

JS3xton commented 8 years ago

(to be continued)

castillohair commented 8 years ago
JS3xton commented 8 years ago

The new binning for bins has introduced some variation with respect to our previous results. However, variation introduced is small, and I don't think this is a problem. If we want to address this (which again I think we don't have to) using a smaller fixed value of sigma just for beads might be enough. I tried changing it to 5, but didn't see significant changes. 2 definitely made it terrible.

mmk. that's enough reason for me not to worry about it.

(I'm on to reviewing code in #213)

castillohair commented 8 years ago
  • regarding mode=mesh vs. mode=scatter, might be nice if that was done automatically, because mode=mesh definitely makes a lot more sense for that particular use case. This is a very small issue, though (and still achievable via the Python interface), and I don't have any great ideas for doing it automatically.

Yeah, I don't have good ideas on how to do this either.

  • regarding the model: I still feel dissatisfied regarding m.
    • Are you sure the minor deviation from 1.0 is not just an instance of overfitting?

I wouldn't call this overfitting. Overfitting occurs when the model is too complex for the data, such that slightly different data would give very different parameter fits (i.e. high variance). In this case, the fact that I'm consistently obtaining values slightly above 1 tells me that m is greater than one.

  • If we had mapped to a.u. from the beginning (or fallen into the linear amp + int data or float data use cases), we would have started off mapping a.u. -> MEF (not log(a.u.) --> MEF). I feel like that would have been a linear transformation (no intercept). Bead autofluorescence might have necessitated an affine transformation again (added y-intercept term). Introduction of the m term (raising the a.u. value to a power) seems very unintuitive to me, though. That might make sense after looking at a.u. vs. MEF on log-log plots (which is only done because we think a logarithmic error model is more appropriate based on the specific beads we chose) and then noticing a deviation (which would be very small?) from a slope of 1. All of this being said, I haven't looked at a standard curve where m was hardcoded to 1; perhaps there really is a noticeable deviation from a slope of 1.

Note that m in the previous model could also be derived theoretically from the amplifier characteristics (number of decades, bits, etc). In fact, Evan and I did that once, and found that the fitted value also consistently differed from the theoretical one. So this model is not more complex that the previous one.

Why is m different than one? I'm not sure. I vaguely attribute this to "detector nonlinearity" but I don't have a great explanation.

  • Also, if there are cytometers which employ log amplifiers and don't properly specify how to map back to RFI, then it might be nice to have the m term to fall back on. Not sure if that happens frequently, though.

The standard explicitly requires that log channels must have information enough to decode them properly.

  • 👍 for warning on geometric statistics if values exist <= 0. Is that thrown in both the terminal output and the "Analysis" column?

Yes.

  • can I combine scales? e.g. xscale=logicle and yscale=log? I would assume so, but haven't actually tried. All the downstream bins calculations and plotting works as expected?

Yep.

data_001 fcs

castillohair commented 8 years ago

213 solves this.