Closed castillohair closed 8 years ago
First, I wanted to see whether data could be processed in a.u., and displayed correctly without the Excel UI. I wrote the following script:
import FlowCal
if __name__ == '__main__':
d2lin = FlowCal.io.FCSData('FCFiles_2.0_linear/B_subtilis_Tube_024.fcs')
d2log = FlowCal.io.FCSData('FCFiles_2.0_log/B_subtilis_Tube_024.fcs')
d3 = FlowCal.io.FCSData('FCFiles_3.0/B_subtilis_Tube_024.fcs')
data = [d2lin, d2log, d3]
# Print channel names, ranges, amplification types
for d, name in zip(data, ['2.0, lin', '2.0, log', '3.0']):
print "\nFor file version {}:".format(name)
for ch, r, a in zip(d.channels, d.range(), d.amplification_type()):
print "Channel {:13}, range: {:15}, a.t.: {}, min: {:14}, max: {:14}".\
format(ch, r, a, min(d[:,ch]), max(d[:,ch]))
print ""
# Transform to rfi
print "Transforming to rfi..."
data_rfi = []
for d in data:
data_rfi.append(FlowCal.transform.to_rfi(d))
# Density gating
print "Performing density gating..."
data_gated = []
gate_contour = []
for d in data_rfi:
g, __, c = FlowCal.gate.density2d(
d,
channels=['FSC-A', 'SSC-A'],
gate_fraction=0.5,
xscale='log',
yscale='log',
full_output=True)
data_gated.append(g)
gate_contour.append(c)
# Plot FSC/SSC/APC-A
print "Generating plots..."
for d, dg, gc, name in \
zip(data_rfi, data_gated, gate_contour, ['2_lin', '2_log', '3']):
FlowCal.plot.density_and_hist(
d,
gated_data=dg,
gate_contour=gc,
density_channels=['FSC-A', 'SSC-A'],
density_params={'xscale': 'log', 'yscale': 'log', 'mode': 'scatter'},
hist_channels=['APC-A'],
hist_params={'xscale': 'log', 'bins': 128},
savefig='density_hist_{}'.format(name))
The resulting terminal output is:
For file version 2.0, lin:
Channel FSC-A , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 1023
Channel SSC-A , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 2, max: 607
Channel FITC-A , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 8
Channel PE-A , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 4
Channel PerCP-Cy5-5-A, range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 8
Channel PE-Cy7-A , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 10
Channel APC-A , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 1023
Channel APC-Cy7-A , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 24
Channel Time , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 7
For file version 2.0, log:
Channel FSC-A , range: [0.0, 1023.0] , a.t.: (4.0, 1.0), min: 0, max: 1023
Channel SSC-A , range: [0.0, 1023.0] , a.t.: (4.0, 1.0), min: 358, max: 965
Channel FITC-A , range: [0.0, 1023.0] , a.t.: (4.0, 1.0), min: 0, max: 487
Channel PE-A , range: [0.0, 1023.0] , a.t.: (4.0, 1.0), min: 0, max: 417
Channel PerCP-Cy5-5-A, range: [0.0, 1023.0] , a.t.: (4.0, 1.0), min: 0, max: 495
Channel PE-Cy7-A , range: [0.0, 1023.0] , a.t.: (4.0, 1.0), min: 0, max: 516
Channel APC-A , range: [0.0, 1023.0] , a.t.: (4.0, 1.0), min: 0, max: 1023
Channel APC-Cy7-A , range: [0.0, 1023.0] , a.t.: (4.0, 1.0), min: 0, max: 608
Channel Time , range: [0.0, 1023.0] , a.t.: (0.0, 0.0), min: 0, max: 7
For file version 3.0:
Channel FSC-A , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -2440.37988281, max: 262143.0
Channel SSC-A , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: 658.469970703, max: 155479.28125
Channel FITC-A , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -114.209999084, max: 2105.12988281
Channel PE-A , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -232.649993896, max: 1119.53991699
Channel PerCP-Cy5-5-A, range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -119.849998474, max: 2263.05004883
Channel PE-Cy7-A , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -573.869995117, max: 2739.62988281
Channel APC-A , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -455.129974365, max: 262143.0
Channel APC-Cy7-A , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: -112.319992065, max: 6243.11962891
Channel Time , range: [0.0, 262143.0], a.t.: (0.0, 0.0), min: 0.0, max: 2018.59997559
Transforming to rfi...
Performing density gating...
Generating plots...
Note that the FCS3.0 floating-point files have negative numbers, as expected from the standard.
The following plots are produced:
density_hist_2_lin.png:
density_hist_2_log.png
density_hist_3.png
Obviously, the numerical values are different from one another (arbitrary units). It really seems like the integer linear file is trying imitate what the floating-point linear file is doing, but chopping off a little more than the lowest 2 decades of info (since the lowest values for linear integer are zero and one). This results in a weird FSC/SSC diagram for integer linear. Nevertheless, density gating seems to perform well in all cases.
This shows that with the current develop version, a user could work with FlowCal's Python API in a.u. without issues. Logicle plots may be desirable, though. Besides that, the only remaining challenge to work with floating linear data seems to be MEFing.
I want to point out a few oddities about FCS file formats. I've seen in several places that FCS 2.0 files are associated with integer representation, and FCS 3.0 with integer or floating point, even though the standard allows for floating point in FCS2.0 as well. For example, this is what we have found with the file repository (#203, notice that no FCS2.0 are DATAYPE='F'
) and the acquisition software of the SEA's flow cytometer (FCS2.0 is saved as integer, 3.0 as floating point).
Another interesting way to refer to these file versions is "analog" (FCS 2.0) and "digital" (FCS 3.0) (http://docs.flowjo.com/vx/graphs-and-gating/gw-transform-overview/). More interestingly, the following is said in the article that initially described the logicle transformation (http://onlinelibrary.wiley.com/doi/10.1002/cyto.a.20258/abstract) (emphasis mine):
...Data sets resulting from computed compensation commonly (and properly) include populations whose distributions extend below zero. (When analog compensation is used, such distributions should also appear, but the electronic implementations distort or truncate the distributions, so that negative values are suppressed).
What I get from this is that integer data saturates at both ends of its range, whereas floating-point data does not (remember that the standard does not require event values to be within 0 and $PnR
).
Why is this relevant to us? Saturation leads to incorrect values, and that is why we use gate.high_low
. However, this may not be necessary with floating point data. The Excel UI will probably need to distinguish between these two cases, and use gate.high_low
only if necessary.
I think up to commit 4bd6c1e8e688bf052e2cd59e4cf4675bec0144b3 we're most of the way towards solving the issue. I will now show the current results:
The following was obtained using the Python API (no Excel UI). The beads files shown above were used to obtain transformation functions.
The bead model equation is the following:
m*log(fl_au[i]) + b = log(fl_mef_auto + fl_mef[i])
The following parameters were obtained from the three beads files:
beads file | m |
b |
fl_mef_auto |
---|---|---|---|
Integer, linear | 1.01 | 2.55 | 104.75 |
Integer, log | 1.01 | 0.26 | 112.21 |
Floating point | 1.01 | -3.05 | 111.88 |
It is interesting to note that the only parameter that changes significantly is the intercept b
. fl_mef_auto
is more similar between the integer log and the floating-point dataset. A slightly different standard curve is to be expected for the case of integer linear because the data is not that good, particularly for low fluorescence values. Note also in the plots above that the number of logs in the x axis is almost exactly the same as the number of logs that the standard curve covers in the y axis, which is a reflection of the slope parameter being so close to 1.
Here I picked a sample with relatively high fluorescence. There are still problems with low fluorescence samples, as I will detail in a later post.
All histograms above seem to have more or less the same MEF fluorescence value. Let's corroborate that by looking at the numbers:
dataset | median APC-A |
---|---|
Integer, linear | 455.36 |
Integer, log | 465.81 |
Floating point | 467.35 |
The values are very similar to each other (CV = 1.15%), although the integer linear dataset is more different (2.27% difference between integer linear and log) to the other two, which differ by only 0.33% from each other.
It is nice seeing that the linear integer case is performing relatively well, even though is such a weird dataset. I'm not too concerned with taking this to the same performance as the other two, but I'm happy to hear suggestions.
The Excel UI has also been modified to work with floating-point files. The most important modifications are:
Using the same commit, the results of running the example files with the Excel UI are pretty much the same, except for density gating of the beads:
Because everything is being processed in log scale, the FSC/SSC density diagram of beads looks different. However, elimination of the secondary peaks in the fluorescence channels is achieved anyway. I would argue that this is the most correct way to do gating, in terms of being consistent with everything else.
The identified peak values change a little bit, though, because the gate is now different. This is a comparison of the bead model parameters for B0002 FL1 with 1.0.0 and the current commit:
Version | m |
b |
fl_mef_auto |
---|---|---|---|
1.0.0 | 1.0571538506455078 | 2.43684936642 | 1180.28240707 |
4bd6c1e8e688bf052e2cd59e4cf4675bec0144b3 | 1.05720955531 | 2.43642936538 | 1180.03466155 |
Error (%) | 0.0053 | 0.017 | 0.021 |
(Note that m from the model in 1.0.0 was converted to m from this model by using 256*m/log(10)
.
This changes the MEF values of the corresponding samples, although negligibly. The following shows percent difference between the different statistics of FL1 of the samples that use B0002 in percentage, with the method in 1.0.0 and the current commit:
In summary, the only change that affects how we run our samples is the fact that beads are gated in log scale, which does not seem to affect the final MEF fluorescence of samples.
The following are some of the results of running the Excel UI agains the SEA dataset.
Beads plots produced by the Excel UI are almost identical to the plots presented above using the Python API. The parameters obtain in one run are the following:
dataset | m |
b |
fl_mef_auto |
---|---|---|---|
Integer, linear | 1.01912269529 | 2.52996967802 | 104.313744082 |
Integer, log | 1.01167477702 | 0.264344458589 | 112.16949497 |
Floating point | 1.01315711837 | -3.05889290482 | 111.522298167 |
Not very different from the numbers obtained with the Python API, even though the sequence of gating steps and the gating fraction were not identical.
Plots are also very similar to the ones presented above. Median fluorescence is as follows:
dataset | median APC-A |
---|---|
Integer, linear | 456.579903810335 |
Integer, log | 469.519100415878 |
Floating point | 466.794512171886 |
It turns out that samples with low fluorescence are saturating to the left, so the data itself is not very useful. However, it shows some of the intricacies of floating-point data. Here are the median APC values:
dataset | median APC-A |
---|---|
Integer, linear | 12.5531254952804 |
Integer, log | 16.3462793998462 |
Floating point | 0.624765572644506 |
Why they are so different can be understood by looking at the plots below.
Linear integer data is probably not very reliable given the resolution, so not much can be said besides the data sucks.
A lot of saturated values are present. If these are eliminated (as the Excel UI would actually do, not shown here) almost half of the population is lost. The remaining events have a larger value than the population, therefore the median fluorescence increases.
Just by looking at the plot in the middle, one would be tempted to say that the median fluorescence is 10-20 MEF. However, the linear plot below shows that the population is actually symmetrical around zero, and there is another peak at the left (perhaps that's how floating point data saturates?). Nevertheless, the log plot is misleading at presenting the central tendency of the population, which actually seems more like zero in the linear plot. A logicle plot would improve visualization of floating-point data.
Geometric statistics are, strictly speaking, not defined if negative elements exist. Since some events with negative or zero value exist in all samples, the Excel UI is currently not calculating geometric means for floating point data. This is not explicit, but an effect of numpy.gmean
's results being nan
and pandas
writing them as empty cells in the output file. This is probably not bad for something like the file shown above, but it is problematic for samples like this:
There are 2 events less than zero, but the whole calculation of the geometric mean is invalidated. I think the best compromise is to calculate geometric statistics on events greater than zero, and throw a warning stating the percentage of events that are lower or equal to zero. That way, in combination with the histogram, the user can decide whether these statistics are meaningful or not.
m*log(fl_au[i]) + b = log(fl_mef_auto + fl_mef[i])
.high_low
gating is not applied if data is floating-point.@JS3xton if you have any feedback, now it's the time.
Logicle plotting is now implemented. It required additional classes in the plot
module (for custom scaling in matplotlib). There was also a small addition to FCSData.hist_bins
to handle the new scaling.
The following was run against f99ccc0e940377cf1902e361ac23b484024cfde7:
import FlowCal
d1 = FlowCal.io.FCSData('FCFiles_3.0/B_subtilis_Tube_001.fcs')
d2 = FlowCal.io.FCSData('FCFiles_3.0/B_subtilis_Tube_008.fcs')
FlowCal.plot.density2d(
d1,
channels=['PE-A', 'APC-A'],
xscale='logicle',
yscale='logicle',
mode='scatter',
savefig='logicle/fl_logicle.png')
FlowCal.plot.scatter2d(
[d1, d2],
channels=['FITC-A', 'APC-A'],
xscale='logicle',
yscale='logicle',
savefig='logicle/scatter_logicle.png')
FlowCal.plot.scatter3d_and_projections(
[d1, d2],
channels=['FITC-A', 'PE-A', 'APC-A'],
xscale='logicle',
yscale='logicle',
zscale='logicle',
savefig='logicle/scatter3d_and_projections_logicle.png')
FlowCal.plot.hist1d(
[d1, d2],
channel='APC-A',
xscale='logicle',
savefig='logicle/hist_logicle_2.png')
I will now integrate this into the Excel UI. As per my previous discussion with @JS3xton, I'll try to make all default to logicle instead of log, and see how the plots with our old file look like.
sigma
is still hardcoded to 10.0 for density2d()
. Is this still appropriate? Some of the FSC-SSC plots looked like the smoothing was too much (this is entirely subjective and relative to what I've gotten used to with our current setup [i.e. FlowCal v1.0.0 on files from our flow cytometer]). I feel like this should be generalized to something proportional to the standard deviation or covariance of the FSC v SSC populations, though, unless there's some black magic in the transformations that degenerates that to a constant. Should keep in mind that this sigma
value is being passed to a 2D Gaussian blur function which is agnostic to our efforts to scale the data (it just sees a 2D matrix I believe). Generalizing would basically require establishing an equivalence between the 2D matrix bin sizes and the effective range / spread of the data in a.u.?gate.high_low()
when appropriate, and indeed you added a check if $DATATYPE is I; 👍mode=mesh
instead of mode=scatter
? May not matter anymore on a logicle scale, but they looked weird with purely log
scale and mode=scatter
.making sure I understand the changes to the standard curve: the old standard curve model was:
m*fl_channel[i] + b = log(fl_mef_auto + fl_mef[i])
and the new model is:
m*log(fl_au[i]) + b = log(fl_mef_auto + fl_mef[i])
only difference being the introduction of log()
around the measured fluorescence value. This is because we were previously calculating this standard curve to map channel values to MEF values, and our channel values are inherently in log(a.u.) scale because we have log amplifier + integer data. This doesn't generalize to linear amplifier + integer data or float data because both report channel values in a.u. scale, not log(a.u.) scale. To address this, we decided to just map everything to a.u. (which effectively exponentiates our log amp + int data) to serialize all 3 cases to a common scale. From there we fit a model mapping a.u. -> MEF. We still go to the trouble of fitting this model in a log space because we still think a logarithmically distributed error model is more appropriate? (this is now tied to our particular logarithmically distributed beads right? linear beads might be more well fit in linear space?). Cause otherwise, this is almost a straightforward line fit; I think you can rearrange as:
fl_mef[i] = 10^b * fl_au[i]^m - fl_mef_auto
the m
parameter of the new standard curves seems to be fundamentally equivalent to 1; should we consider a simpler model without that term (only the intercept and autofluorescence)? As above, this becomes exactly a line fit:
fl_mef[i] = 10^b * fl_au[i] - fl_mef_auto
(which may still be more appropriately fit in log space)
(to be continued)
sigma
just for beads might be enough. I tried changing it to 5, but didn't see significant changes. 2 definitely made it terrible..data_type
.m
does not exactly equal to one. This is the paper data (one sample at different gains) rerun with the latest floating-mef
:
Which is why I'd rather not remove m
from the model.The new binning for bins has introduced some variation with respect to our previous results. However, variation introduced is small, and I don't think this is a problem. If we want to address this (which again I think we don't have to) using a smaller fixed value of
sigma
just for beads might be enough. I tried changing it to 5, but didn't see significant changes. 2 definitely made it terrible.
mmk. that's enough reason for me not to worry about it.
mode=mesh
vs. mode=scatter
, might be nice if that was done automatically, because mode=mesh
definitely makes a lot more sense for that particular use case. This is a very small issue, though (and still achievable via the Python interface), and I don't have any great ideas for doing it automatically.m
.
m
term (raising the a.u. value to a power) seems very unintuitive to me, though. That might make sense after looking at a.u. vs. MEF on log-log plots (which is only done because we think a logarithmic error model is more appropriate based on the specific beads we chose) and then noticing a deviation (which would be very small?) from a slope of 1. All of this being said, I haven't looked at a standard curve where m
was hardcoded to 1; perhaps there really is a noticeable deviation from a slope of 1.m
term to fall back on. Not sure if that happens frequently, though.xscale=logicle
and yscale=log
? I would assume so, but haven't actually tried. All the downstream bins calculations and plotting works as expected?(I'm on to reviewing code in #213)
- regarding
mode=mesh
vs.mode=scatter
, might be nice if that was done automatically, becausemode=mesh
definitely makes a lot more sense for that particular use case. This is a very small issue, though (and still achievable via the Python interface), and I don't have any great ideas for doing it automatically.
Yeah, I don't have good ideas on how to do this either.
- regarding the model: I still feel dissatisfied regarding
m
.
- Are you sure the minor deviation from 1.0 is not just an instance of overfitting?
I wouldn't call this overfitting. Overfitting occurs when the model is too complex for the data, such that slightly different data would give very different parameter fits (i.e. high variance). In this case, the fact that I'm consistently obtaining values slightly above 1 tells me that m
is greater than one.
- If we had mapped to a.u. from the beginning (or fallen into the linear amp + int data or float data use cases), we would have started off mapping a.u. -> MEF (not log(a.u.) --> MEF). I feel like that would have been a linear transformation (no intercept). Bead autofluorescence might have necessitated an affine transformation again (added y-intercept term). Introduction of the
m
term (raising the a.u. value to a power) seems very unintuitive to me, though. That might make sense after looking at a.u. vs. MEF on log-log plots (which is only done because we think a logarithmic error model is more appropriate based on the specific beads we chose) and then noticing a deviation (which would be very small?) from a slope of 1. All of this being said, I haven't looked at a standard curve wherem
was hardcoded to 1; perhaps there really is a noticeable deviation from a slope of 1.
Note that m
in the previous model could also be derived theoretically from the amplifier characteristics (number of decades, bits, etc). In fact, Evan and I did that once, and found that the fitted value also consistently differed from the theoretical one. So this model is not more complex that the previous one.
Why is m
different than one? I'm not sure. I vaguely attribute this to "detector nonlinearity" but I don't have a great explanation.
- Also, if there are cytometers which employ log amplifiers and don't properly specify how to map back to RFI, then it might be nice to have the
m
term to fall back on. Not sure if that happens frequently, though.
The standard explicitly requires that log channels must have information enough to decode them properly.
- 👍 for warning on geometric statistics if values exist <= 0. Is that thrown in both the terminal output and the "Analysis" column?
Yes.
- can I combine scales? e.g.
xscale=logicle
andyscale=log
? I would assume so, but haven't actually tried. All the downstream bins calculations and plotting works as expected?
Yep.
It turns out that the SEA's flow cytometer can export data in FCS 2.0 in both linear and log scale, and 3.0 in linear scale with floating points. I got all three versions after one of my experiments, including several cell samples and beads. @JS3xton if you're interested in looking at them, I can put them on Dropbox or something.
This issue will be considered resolved when the three datasets result in the same MEF values for each cell sample when run through the Excel UI. There may be some issues in processing integer linear data, so at the very least integer log data and floating-point linear data should work.
I will now show the result of running the Excel UI with the three sets of files, as performed by the current
floating-mef
branch (43a9dcb5dad54e789e89416e768e838aab01c10e).FCS2.0 with log amplifier, 10-bit integer
This is the setup that we currently support, so the following results are what we would consider correct.
Beads
Initial density/histogram plot: 2D Clustering: Populations used for standard curve generation: Standard curve:
Cell sample
FCS2.0 with linear amplifier, 10-bit integer
Beads
Initial density/histogram plot: 2D Clustering: Populations used for standard curve generation: Standard curve:
Cell sample
Obviously all of this is super incorrect. Interestingly, clustering seems to work fine.
FCS3.0 with linear amplifier, floating point
Beads
Initial density/histogram plot:
2D Clustering:
Populations used for standard curve generation:
Standard curve:
Cell sample
Also super incorrect. Note that the failure modes here seem to be similar to the case of FCS2.0 integer with linear amplifier, suggesting that amplification type and not resolution will be the major challenge.