castillohair commented 9 years ago

Values in column "Gated Counts/millisecond" in output excel file range from 7 to 13. This means that acquisition of these values was performed at 23000 to 43000 events/second (taking into account gating), which is unusual. My guess is that calculation of counts/millisecond is wrong.

castillohair commented 9 years ago

I did some tests with the data.001 file in the examples directory.

>>> import fc
>>> d = fc.io.TaborLabFCSData('examples/FCFiles/data.001')

According to http://onlinelibrary.wiley.com/doi/10.1002/cyto.990110303/abstract, BTIM and ETIM register the beginning and end acquisition times. http://murphylab.web.cmu.edu/FCSAPI/FCS3.html says that the same is true in FCS3.0.

>>> d.text['$BTIM']
'20:57:49'
>>> d.text['$ETIM']
'20:58:09'
>>> time.mktime(time.strptime(d.text['$ETIM'], '%H:%M:%S')) - time.mktime(time.strptime(d.text['$BTIM'], '%H:%M:%S'))
20.0

In addition, the 'Time' channel registers the time at which each event is detected.

>>> d[:,'Time']
TaborLabFCSData([  0,   0,   0, ..., 388, 388, 388], dtype=uint16)
>>> max(d[:,'Time'])
388

However the units are not obvious. The FCS3.0 standard has the $TIMESTEP text attribute which indicates the value of each step in seconds, as a floating point number (http://murphylab.web.cmu.edu/FCSAPI/FCS3.html). However the FCS2.0 specification doesn't have a similar parameter (http://onlinelibrary.wiley.com/doi/10.1002/cyto.990110303/abstract). Some less reliable sources mention the TIMETICKS attribute, which specifies the value of each step in milliseconds (flowrepository.org/experiments/4/attachments/786/download). If we assume this to be true, then

>>> d.text['TIMETICKS']
'50'
>>> max(d[:,'Time'])*float(d.text['TIMETICKS'])/1000.
19.399999999999999

which is very similar to the value of 20 calculated before.

castillohair commented 9 years ago

Considering that we will be using FCS3.0 files soon, I can think of three solutions to this:

Use BTIM and ETIM to calculate the elapsed time. I know that @thoreusc had some issues with this method because is lower resolution than using the event list, but it seems to be the official way if all we care about are start/end points. At least, this will have to be the default method if time data is not present.
Write a function in the FCSData class that returns the timestep, and let the class worry about whether it should use TIMETICKS or TIMESTEPS, or anything else. The caller will then have to multiply the time step with the Time channel to obtain meaningful numbers.
Let the class rescale the Time channel to units that make sense (like seconds) at creation time.

In addition, I propose changing the "Gated Counts/millisecond" column by "Acquisition Time (s)". This will allow the user to calculate gated or ungated counts/second using standard excel functionality.

Opinions? @BrianLandry @thoreusc @JS3xton

JS3xton commented 9 years ago

I agree with everything that you've uncovered:

I think TIMETICKS is a BD acquisition (CellQuest)-specific keyword (not in the standards) which can be trusted to convert the time signal to seconds for the FCS2.0 files.
- Any solution using TIMETICKS will be deprecated, though, as we move to FlowJo FCS3.0 files (or for any other non-BD acquisition software).
- All FCS3.0 files that I have access to (files from our new setup and Shridhar's from Baylor) have a $TIMESTEP keyword which can be used instead.

I'm leery of (1), but it's probably good enough for first-order approximation (it's also the only option, as you mention, if you don't have a time signal). Depends on what the user wants:

Generally, it's more robust (and higher resolution, as you point out) to fit line to time signal rather than relying on two points.
I've plotted # counts v. time and I don't always see a nice line (i.e. the event rate changes; you get kinks/elbows). Unclear what should be done here, depends on application. This could also be exacerbated (or alleviated) by gating.
Minor issues like the fact that acquisition time may not span collection of first and last gated event. Especially if we're using things like fc.gate.start_stop() to throw out portions of the data.

My first impression is that, philosophically, I don't like (2) and (3) because I feel like they bog down the FCS file objects, but the jury is still out for me.

For
- I can see why abstracting away this time calculation would be convenient coming from a top-down coding perspective.
- I'm not sure when you would ever not want the time signal to be in meaningful units (seconds or milliseconds).
Against
- I think the FCS file objects should be very minimal objects which simply expose read-only views of otherwise unusable binary files. Give the user all the power to do as they will with the parsed contents of the file; make as few assumptions as possible.
Raises the general question of how much preprocessing should FCS file objects do?
- I tend to think somewhere close to "none". I've seen other Python flow cytometry packages try to do too much and we ended up not using them.
- Playing devil's advocate, if you preprocess the time signal, why aren't you also preprocessing the fluorescence channels? i.e. there should never be a "raw" numpy array with channel #s in it, it should only ever be MEFed data. If you don't do this, then you've arbitrarily decided that the time signal is worth preprocessing, but the rest of the channels are not.

I'm also in favor of just giving the acquisition duration in the Excel spreadsheet. This is a number that you are confident of; very few assumptions go into that number. Let the user make assumptions if they're comfortable with them and calculate their event rate on their own (do you also output total number of gated and ungated events? I don't know; haven't used it yet).

castillohair commented 9 years ago

I believe the time channel is different because it is the only quantity in which there is almost universal agreement about the appropriate units to use (unless you ask a hardcore physicist). As you said, I can't think of any reason to not use seconds or milliseconds for time information. However, I also don't like the idea of modifying the time channel right when the file is loaded. Maybe a transform function could be implemented to handle this?

I didn't consider fitting a line to # events vs time. I run some tests and got the following results Raw data: time0

events/sec (BTIM/ETIM method): 930.35
events/sec (events method): 959.12
events/sec (slope method): 964.77
Error in BTIM/ETIM method: 3.00%
Error in slope method: 0.59%

After start/end trimming time1

events/sec (BTIM/ETIM method): 912.85
events/sec (events method): 958.37
events/sec (slope method): 965.53
Error in BTIM/ETIM method: 4.75%
Error in slope method: 0.74%

After low/high gating time2

events/sec (BTIM/ETIM method): 859.75
events/sec (events method): 902.62
events/sec (slope method): 909.07
Error in BTIM/ETIM method: 4.75%
Error in slope method: 0.71%

After density gating (30%): time3

events/sec (BTIM/ETIM method): 257.95
events/sec (events method): 270.81
events/sec (slope method): 272.84
Error in BTIM/ETIM method: 4.75%
Error in slope method: 0.74%

Highlights:

Linear fits are very good for this dataset, even though differences can be seen for the beginning and end in the first plot. Successive gating operations improve the fit.
I'm using three methods to estimate events/sec: BTIM/ETIM as described above, events (time in last event vs time in first event), and slope (1/slope from the fit shown in the plot). I'm calculating the error of the BTIM/ETIM and slope methods with respect to the events method.
The slope and events methods give very similar results (within 0.75% of each other).
The BTIM/ETIM method gives a <5% difference with respect to the events method)

In my opinion, the precision difference that results from using the slope method as opposed to the events method is negligible, and I'm not convinced that the slope method is theoretically more correct.

castillohair commented 9 years ago

Proposed changes:

After checking @JS3xton comments and running some analysis, I propose the following changes:

I will add a time_scaling property to the FCSData file. time_scaling will read the appropriate text field and return a value such that data[:,'Time']*data.time_scaling should be in seconds.
I will add a acquisition_time property. This will be calculated using data.time_scaling*(data[-1, 'Time'] - data[0, 'Time']) if Time is a valid channel, otherwise it will use the BTIM/ETIM method.
The spreadsheet will now register the acquisition time. The user will be free to calculate gated or ungated events/sec using excel functionality if he/she so desires.

A transform function that converts time to seconds (for the lazy, like myself) may be implemented in the future.

I will wait a couple of hours for comments, otherwise I'll start implementing this.

BrianLandry commented 9 years ago

@castillohair's proposed changes sound good to me. I like output acquisition time instead of a rate too.

castillohair commented 9 years ago

Solved in e9ecffd4362481455cb8d8675527d21cc0ecd13c.

taborlab / FlowCal

Strange gated counts/millisecond in output excel file #34

Proposed changes: