taborlab / FlowCal

Python Flow Cytometry Calibration Library
MIT License
48 stars 23 forks source link

Strange gated counts/millisecond in output excel file #34

Closed castillohair closed 9 years ago

castillohair commented 9 years ago

Values in column "Gated Counts/millisecond" in output excel file range from 7 to 13. This means that acquisition of these values was performed at 23000 to 43000 events/second (taking into account gating), which is unusual. My guess is that calculation of counts/millisecond is wrong.

castillohair commented 9 years ago

I did some tests with the data.001 file in the examples directory.

>>> import fc
>>> d = fc.io.TaborLabFCSData('examples/FCFiles/data.001')

According to http://onlinelibrary.wiley.com/doi/10.1002/cyto.990110303/abstract, BTIM and ETIM register the beginning and end acquisition times. http://murphylab.web.cmu.edu/FCSAPI/FCS3.html says that the same is true in FCS3.0.

>>> d.text['$BTIM']
'20:57:49'
>>> d.text['$ETIM']
'20:58:09'
>>> time.mktime(time.strptime(d.text['$ETIM'], '%H:%M:%S')) - time.mktime(time.strptime(d.text['$BTIM'], '%H:%M:%S'))
20.0

In addition, the 'Time' channel registers the time at which each event is detected.

>>> d[:,'Time']
TaborLabFCSData([  0,   0,   0, ..., 388, 388, 388], dtype=uint16)
>>> max(d[:,'Time'])
388

However the units are not obvious. The FCS3.0 standard has the $TIMESTEP text attribute which indicates the value of each step in seconds, as a floating point number (http://murphylab.web.cmu.edu/FCSAPI/FCS3.html). However the FCS2.0 specification doesn't have a similar parameter (http://onlinelibrary.wiley.com/doi/10.1002/cyto.990110303/abstract). Some less reliable sources mention the TIMETICKS attribute, which specifies the value of each step in milliseconds (flowrepository.org/experiments/4/attachments/786/download). If we assume this to be true, then

>>> d.text['TIMETICKS']
'50'
>>> max(d[:,'Time'])*float(d.text['TIMETICKS'])/1000.
19.399999999999999

which is very similar to the value of 20 calculated before.

castillohair commented 9 years ago

Considering that we will be using FCS3.0 files soon, I can think of three solutions to this:

  1. Use BTIM and ETIM to calculate the elapsed time. I know that @thoreusc had some issues with this method because is lower resolution than using the event list, but it seems to be the official way if all we care about are start/end points. At least, this will have to be the default method if time data is not present.
  2. Write a function in the FCSData class that returns the timestep, and let the class worry about whether it should use TIMETICKS or TIMESTEPS, or anything else. The caller will then have to multiply the time step with the Time channel to obtain meaningful numbers.
  3. Let the class rescale the Time channel to units that make sense (like seconds) at creation time.

In addition, I propose changing the "Gated Counts/millisecond" column by "Acquisition Time (s)". This will allow the user to calculate gated or ungated counts/second using standard excel functionality.

Opinions? @BrianLandry @thoreusc @JS3xton

JS3xton commented 9 years ago

I agree with everything that you've uncovered:

I'm leery of (1), but it's probably good enough for first-order approximation (it's also the only option, as you mention, if you don't have a time signal). Depends on what the user wants:

My first impression is that, philosophically, I don't like (2) and (3) because I feel like they bog down the FCS file objects, but the jury is still out for me.

I'm also in favor of just giving the acquisition duration in the Excel spreadsheet. This is a number that you are confident of; very few assumptions go into that number. Let the user make assumptions if they're comfortable with them and calculate their event rate on their own (do you also output total number of gated and ungated events? I don't know; haven't used it yet).

castillohair commented 9 years ago

I believe the time channel is different because it is the only quantity in which there is almost universal agreement about the appropriate units to use (unless you ask a hardcore physicist). As you said, I can't think of any reason to not use seconds or milliseconds for time information. However, I also don't like the idea of modifying the time channel right when the file is loaded. Maybe a transform function could be implemented to handle this?

I didn't consider fitting a line to # events vs time. I run some tests and got the following results Raw data: time0

events/sec (BTIM/ETIM method): 930.35
events/sec (events method): 959.12
events/sec (slope method): 964.77
Error in BTIM/ETIM method: 3.00%
Error in slope method: 0.59%

After start/end trimming time1

events/sec (BTIM/ETIM method): 912.85
events/sec (events method): 958.37
events/sec (slope method): 965.53
Error in BTIM/ETIM method: 4.75%
Error in slope method: 0.74%

After low/high gating time2

events/sec (BTIM/ETIM method): 859.75
events/sec (events method): 902.62
events/sec (slope method): 909.07
Error in BTIM/ETIM method: 4.75%
Error in slope method: 0.71%

After density gating (30%): time3

events/sec (BTIM/ETIM method): 257.95
events/sec (events method): 270.81
events/sec (slope method): 272.84
Error in BTIM/ETIM method: 4.75%
Error in slope method: 0.74%

Highlights:

In my opinion, the precision difference that results from using the slope method as opposed to the events method is negligible, and I'm not convinced that the slope method is theoretically more correct.

castillohair commented 9 years ago

Proposed changes:

After checking @JS3xton comments and running some analysis, I propose the following changes:

A transform function that converts time to seconds (for the lazy, like myself) may be implemented in the future.

I will wait a couple of hours for comments, otherwise I'll start implementing this.

BrianLandry commented 9 years ago

@castillohair's proposed changes sound good to me. I like output acquisition time instead of a rate too.

castillohair commented 9 years ago

Solved in e9ecffd4362481455cb8d8675527d21cc0ecd13c.