AssertionError when calling calc_bg from .h5 file

tritemio / FRETBursts

Burst analysis software for smFRET. **Moved to OpenSMFS organization**

https://github.com/OpenSMFS/FRETBursts

GNU General Public License v2.0

16 stars 17 forks source link

AssertionError when calling calc_bg from .h5 file #47

Closed ncodina closed 7 years ago

ncodina commented 8 years ago

After converting my ptu file to hdf5, I am getting an error trying to calculate background rates.

From what I can tell, the hdf5 file is correct as it is loading into HDFView. Here is a link to download the file if that helps: https://www.dropbox.com/s/2n65978ei58kd50/test_DM1_smFRET.h5?dl=0

In [1]: import fretbursts as fb
 - Optimized (cython) burst search loaded.
 - Optimized (cython) photon counting loaded.
--------------------------------------------------------------
 You are running FRETBursts (version 0.5.9).

 If you use this software please cite the following paper:

   FRETBursts: An Open Source Toolkit for Analysis of Freely-Diffusing Single-Molecule FRET
   Ingargiola et al. (2016). http://dx.doi.org/10.1371/journal.pone.0160716 

--------------------------------------------------------------

In [2]: d = fb.loader.photon_hdf5('/Users/xxx/05_SM_data/test_DM1_smFRET.h5')

In [3]: d.calc_bg(fb.bg.exp_fit, time_s = 30, tail_min_us='auto')
 - Calculating BG rates ... ---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-3-66994717892f> in <module>()
----> 1 d.calc_bg(fb.bg.exp_fit, time_s = 30, tail_min_us='auto')

/Users/nuriacodina/anaconda/lib/python3.5/site-packages/fretbursts/burstlib.py in calc_bg(self, fun, time_s, tail_min_us, F_bg, error_metrics)
   1467         if tail_min_us == 'auto':
   1468             bg_auto_th = True
-> 1469             Th_us = self._get_auto_bg_th_arrays(F_bg=F_bg)
   1470         else:
   1471             bg_auto_th = False

/Users/nuriacodina/anaconda/lib/python3.5/site-packages/fretbursts/burstlib.py in _get_auto_bg_th_arrays(self, F_bg, tail_min_us0)
   1364             for ich, ph in enumerate(self.iter_ph_times(ph_sel=ph_sel)):
   1365                 if ph.size > 0:
-> 1366                     bg_rate, _ = bg.exp_fit(ph, tail_min_us=tail_min_us0)
   1367                     th_us[ich] = 1e6 * F_bg / bg_rate
   1368             Th_us[ph_sel] = th_us

/Users/nuriacodina/anaconda/lib/python3.5/site-packages/fretbursts/background.py in exp_fit(ph, tail_min_us, clk_p, error_metrics)
    122     return _exp_fit_generic(ph, fit_fun=exp_fitting.expon_fit,
    123                             tail_min_us=tail_min_us, clk_p=clk_p,
--> 124                             error_metrics=error_metrics)
    125 
    126 def exp_cdf_fit(ph, tail_min_us=None, clk_p=12.5e-9, error_metrics=None):

/Users/nuriacodina/anaconda/lib/python3.5/site-packages/fretbursts/background.py in _exp_fit_generic(ph, fit_fun, tail_min_us, tail_min_p, clk_p, error_metrics)
     85         tail_min = tail_min_us*1e-6/clk_p
     86 
---> 87     res = fit_fun(dph, s_min=tail_min, calc_residuals=error_metrics is not None)
     88     Lambda, residuals, x_residuals, s_size = res
     89 

/Users/nuriacodina/anaconda/lib/python3.5/site-packages/fretbursts/fit/exp_fitting.py in expon_fit(s, s_min, offset, calc_residuals)
     75     """
     76     if s_min > 0: s = s[s >= s_min] - s_min
---> 77     assert s.size > 10
     78 
     79     # Maximum likelihood estimator of the waiting-time

AssertionError: 

In [4]

tritemio commented 8 years ago

Sometimes the automatic background fitting (with default parameters) can fail like in this case. This happens if the background is very different than typical levels in smFRET experiments.

I quickly run your file and the problem is much bigger though. The distribution of timestamps delays (time lags between macro-times) is truncated for the D-channel and has a strange pattern in both D and A channels.

I don't know why this is happening. Maybe a decoding error or a wrong hardware configuration. I attach the notebook I used to process your file. You can find more info there, maybe it can give you some hints:

https://gist.github.com/tritemio/7593a5f31387e4873784f529b6a9e14e

ncodina commented 8 years ago

I don't know. I tried to run a sample collected at 40 MHz (the previous was at 20 MHz) and the D-channel is also truncated, at half the time than before. I contacted PicoQuant to ask for information about ptu files, I will also ask them if they can help us decode the file and understand what is happening.

I think the background looks better in this sample. It can be related to a BSA coating that I tried to do to the cover-slip in the previous sample, previous to measuring, to prevent surface adsorption.

I'm attaching the links to the new ptu and h5 files, in case you want to have a look at them: https://liveuclac-my.sharepoint.com/personal/ucbecod_ucl_ac_uk/_layouts/15/guestaccess.aspx?guestaccesstoken=ge7x09Rhn3ppgtoSniZd%2fLLORLEJbSG5SJ5zJ0YrfEI%3d&docid=1bb34a81eb3844956b2be73586af6726f&rev=1

https://liveuclac-my.sharepoint.com/personal/ucbecod_ucl_ac_uk/_layouts/15/guestaccess.aspx?guestaccesstoken=DhEiKDcDxPtQpHtRKNLCmRJfvmP%2b%2bZo2Zjo38DsAdLk%3d&docid=167b9c3e2c8f34919bf13a0f8eae9254f&rev=1

tritemio commented 8 years ago

Looking into the code, FRETBursts, for non-ALEX measurements, actually requires that the arrays (detectors, timestamps and nanotimes) have only photons of donor and acceptor channels (no rollover or other spurious counts).

You should re-create the Photon-HDF5 file removing these counts. Sorry about that. I'll fix this issue in FRETBursts in the next release. For now I think you can make it work removing the spurious counts. It makes sense if you think about it. There is a timestamp each rollover so there cannot be a time-lag between timestamps larger than the rollover time.

This does not explain the fine pattern in the histogram of timestamps delays. It look too deterministic to be noise.

tritemio commented 8 years ago

PS: we usually don't treat the coverglass surface for freely-diffusing measurements. At most we add some Tween20 if sticking is an issue. But focusing >50um above the surface sticking should not be an issue.

tritemio commented 8 years ago

I committed a fix on master that allows loading smFRET files with spurious counts (like in your files). In your test file you can compute the background with:

d.calc_bg(fun=bg.exp_fit, time_s=30, tail_min_us=(20, 300, 300))

The Dem delays distribution is not truncated anymore. However the all-photons stream is still truncated at ~100us (and that's why I used a bg threshold of 20us).

At least the FRET histogram and the TCSPC histograms for the high FRET population (only photons inside bursts) look reasonable:

d.burst_search()
ds = d.select_bursts(select_bursts.size, th1=50)
dplot(ds, hist_fret)
ds2 = ds.select_bursts(select_bursts.E, E1=0.6)
mask_d = ds2.ph_in_bursts_mask_ich(ph_sel=Ph_sel(Dex='Dem'))
mask_a = ds2.ph_in_bursts_mask_ich(ph_sel=Ph_sel(Dex='Aem'))

kw = dict(bins=np.arange(1, 100, 0.1), histtype='step', lw=2)
plt.hist(d.nanotimes_[mask_d]*1.6e-2, **kw)
plt.hist(d.nanotimes_[mask_a]*1.6e-2, **kw)
plt.yscale('log')
plt.xlabel('ns')

ncodina commented 8 years ago

Hi @tritemio, thanks very much.

I re-wrote the 20MHz .h5 file with the following nanotimes parameters:

[{'tcspc_num_bins': 3125, 'tcspc_range': 5e-08, 'tcspc_unit': 1.6e-11}]

In the FRETbursts, I used your loading function to remove the overflow counts:

(array([  0,   1,  34,  50,  68, 127], dtype=uint8), array([ 6023938, 11473517,        1,        1,        2,  7234428]))
(array([0, 1], dtype=uint8), array([ 6023938, 11473517]))

and to compute the background:

d.calc_bg(fun=bg.exp_fit, time_s=30, tail_min_us=(20, 300, 300))

I think the background and the FRET histogram look good:

20MHz TimeTrace: 20mhz_timetrace

20MHz Background Histogram: 20mhz_bg_hist

20MHz Background Timetrace: 20mhz_bg_timetrace

20MHz FRET: 20mhz_fret_fit

However, I am having some problem when calculating the background for the 40 MHz file. In this case, nanotimes parameters:

[{'tcspc_num_bins': 3125, 'tcspc_range': 2.5e-08, 'tcspc_unit': 8e-12}]

loading function:

(array([  0,   1,  34,  50,  68,  72, 127], dtype=uint8), array([1448765,  668797,       1,       1,      80,      52, 1744405]))
(array([0, 1], dtype=uint8), array([1448765,  668797]))

using the same expression to calculate the background:

d.calc_bg(fun=bg.exp_fit, time_s=30, tail_min_us=(20, 300, 300))

40MHz TimeTrace: 40mhz_timetrace

40MHz Background Histogram: 40mhz_bg_hist

40MHz Background Timetrace: 40mhz_bg_timetrace

Do you know what could be happening? or how do you decide the values for tail_min_us (minimum waiting-time in micro-secs)?

The .h5 files: https://liveuclac-my.sharepoint.com/personal/ucbecod_ucl_ac_uk/_layouts/15/guestaccess.aspx?guestaccesstoken=%2fYcIXxzZIEarfWTt%2fmOCeVBfzFNZXyrgSBa8YweXZH0%3d&docid=1ce0373e2ef4a4c318cce721edf455299&rev=1

https://liveuclac-my.sharepoint.com/personal/ucbecod_ucl_ac_uk/_layouts/15/guestaccess.aspx?guestaccesstoken=0tXctelxkIiL6BH8%2bojRiSs58D0SwR8qrimbfFKPhEo%3d&docid=195655d2caf2642f4870834c3a3faab86&rev=1

tritemio commented 8 years ago

@ncodina, to manually set the threshold(s) for background fitting in the different photon-streams you can look at the histogram of delays that you plotted here. You want to take enough of the exponential tail (it looks a straight line in log scale) to have a good estimate. If the threshold is too high you have only a few samples. If it is too low you also include the "high-rate" peak due to bursts.

In your case, by eye I would use 300us for DexDem photons and 200us for DexAem photons.

What I don't understand is why the all-photons stream has a cut-off of the delays distribution at ~50us. This does not look right. Just for the sake of preventing the error during the fit you can try further decreasing the all-photons threshold to 10us (e.g. tail_min_us=(10, 300, 200)). But this is a palliative and the background fit for the all-photon stream will be bogus anyway.

tritemio commented 8 years ago

For further details on background estimation see the FRETBursts paper section "Background Definitions":

http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0160716#sec007

and section "Background Estimation":

http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0160716#sec015

(I can only link to the main section, not to the subsections of the PLOS ONE paper. The biorxiv version of the paper has also working links to the documentation).

ncodina commented 7 years ago

Thanks so much @tritemio. I did read about background calculation and I have some further questions:

1) The function "hist_bg" is plotting the interphoton delay distribution (the histogram of waiting times between 2 subsequent timestamps)? Why is it necessary to call the function "calc_bg" before? We don't need to calculate background rates to generate this plot?

2) These are the backgrounds that I'm getting for: just buffer, just donor fluorophore, and protein labeled with donor and acceptor.

Buffer:

Donor Fluorophore:

Protein labeled with Donor and Acceptor fluorophores:

Sometimes I get this result for protein as well:

What does it mean that at very high photon rates (short time intervals) the number of events decreases? It also happens on the buffer sample. Is that what should be expected? And does it mean that protein concentration is too low? And why can't it calculate the background rate in this case?

3) Is the function "calc_bg" just calculating the background rate at each time interval? Then when are the bursts corrected for background? Is it later in the function "select_bursts" that the background is subtracted?

tritemio commented 7 years ago

@ncodina, you are right, there is no reason why plotting the interphoton distribution should require fitting background first. Since it is rare having problem fitting the background this problems was not evident before. I'll fix it. Since hist_bg name can also be confusing, I think I'll add a new plot function hist_interphoton which only plots the distributions for the photon streams without any fit (and maybe hist_bg become a thin wrapper around it adding the fitted rates)
Of all the samples you show only the last one look a true single-molecule concentration. In the oter cases the single bursts are not well separated or not visible at all. The distribution decreases because it is the superposition of a "peak" due to bursts and to an exponential tail.
The bursts are normally corrected for background automatically after each burst search d.burst_search. Some operation can retrigger the background correction if the D/A photons are counted from the timestamp array (i.e. any operation internally calling d.calc_ph_num()). You can safely assume that bursts are always background corrected, unless you explicitly disable the correction with:

d.burst_search(m=10, computefret=False, ph_sel=Ph_sel(Dex='DAem'))
d.calc_fret(count_ph=True, corrections=False)

(example taken from the BVA notebook)

Finally I found the reason for the fine pattern in the interphoton distribution. It is a floating point issue causing instabilities in assigning photons to bins when the bin edges fall "exactly" at multiple of a timestamps_unit (25 or 50 ns in your case). Shifting the bin edges by 0.5*timestamps_unit solves the issue. I'll take this into account in the new hist_interphoton plot function.

ncodina commented 7 years ago

@tritemio thanks very much. The problems I was having were coming from not having the correct timestamps. Since you fixed the problem in phconvert and we have the correct the timestamps, I did not find problems measuring background rates.

The auto function has worked for all my measurements:

d.calc_bg(fun=bg.exp_fit, time_s=30, tail_min_us='auto')

and the measured background rates are below 2 kcps.

tritemio commented 7 years ago

Ok, closing then.