2.0 Upgrade compatibility - LightCurve.bin() API changes

dhomeier commented 4 years ago

Problem description

Fleshing out the details on one of the API changes related to the migration to TimeSeries in #744:

lc.bin() now takes time_bin_size and time_bin_start arguments. This is for consistency with astropy.timeseries.aggregate_downsample. This significantly alters the behavior of binning, because bins are now defined in time instead of by number of data points.

(largely copied and pasted from https://github.com/KeplerGO/lightkurve/pull/744#issuecomment-685993661): Also related to this migration it now has an n_bins instead of the bins argument, which I think can easily lead to further confusion (at least did in my case), since aggregate_downsample always constructs a binned timeseries of length n_bins * time_bin_size even with the latter just set to the default (0.5 d). I.e. the length of the returned time series may have little relation to that of the input.

As in the case below, lc.bin() by default downsamples into the appropriate number of 12 h-bins (187), whereas selecting a smaller or larger value for n_bins without other modifications just creates fewer or more bins of the same size, ether covering only part of the full time series or returning so many bins without any data:

Example

>>> import lightkurve as lk
>>> lc = lk.search_lightcurve('KIC 10264202', quarter=10).download()
>>> print(len(lc), lc[[0, 1, -2, -1]])
4447        time             flux         flux_err    quality    timecorr   centroid_col centroid_row ... mom_centr1 mom_centr1_err mom_centr2 mom_centr2_err   pos_corr1      pos_corr2   
                    electron / s   electron / s                d           pix          pix      ...    pix          pix          pix          pix            pix            pix      
------------------ -------------- -------------- ------- ------------- ------------ ------------ ... ---------- -------------- ---------- -------------- -------------- --------------
 906.8458140060029            nan            nan       0  2.381186e-03    883.87540    375.10416 ...  883.87540  7.4842718e-04  375.10416  5.2529416e-04  1.0273670e-02  8.9260764e-02
 906.8662481916836  6.6536602e+03  3.3651555e+00    2048  2.381672e-03    883.87373    375.10583 ...  883.87373  7.1239215e-04  375.10583  5.0058536e-04  1.0554121e-02  8.9282103e-02
1000.2477641975565  7.1289229e+03  3.4627781e+00       0  2.327278e-03    883.86059    374.98315 ...  883.86059  6.7171291e-04  374.98315  4.6940739e-04 -1.2190943e-02 -8.3419129e-02
1000.2681971783459  7.1183604e+03  3.4625471e+00       0  2.326758e-03    883.86157    374.98360 ...  883.86157  6.7250407e-04  374.98360  4.6981242e-04 -1.2136324e-02 -8.3385088e-02

>>> lc_binned = lc.bin()
>>> print(len(lc_binned), lc_binned[[0, 1, -2, -1]])
187        time              flux            flux_err        time_bin_start  time_bin_size ...     mom_centr2         mom_centr2_err           pos_corr1            pos_corr2      
                     electron / s      electron / s                            s       ...        pix                  pix                    pix                  pix         
------------------ ---------------- ------------------ ----------------- ------------- ... ------------------ ---------------------- --------------------- --------------------
 907.0958140060029 7062.68505859375 0.6708430480957032 906.8458140060029       43200.0 ... 375.10537754190693 0.00047764836926944554   0.01188729889690876  0.08990895003080368
 907.5958140060029 7124.16259765625 0.7004288832346598 907.3458140060029       43200.0 ...   375.105621099136 0.00047317345160990953   0.01387943048030138  0.09079039096832275
 999.5958140060029 7129.97802734375  0.692869644165039 999.3458140060029       43200.0 ...  374.9837459767506 0.00046900490997359157 -0.012105023488402367 -0.08268008381128311
1000.0958140060029    7034.47265625 0.7530058906191871 999.8458140060029       43200.0 ... 374.98347148604904  0.0004754075198434293 -0.012066206894814968 -0.08332391083240509
>>> lc_binned = lc.bin(n_bins=100)
>>> print(len(lc_binned), lc_binned[[0, 1, -2, -1]])
100        time             flux            flux_err        time_bin_start  time_bin_size ...     mom_centr2         mom_centr2_err            pos_corr1              pos_corr2      
                    electron / s      electron / s                            s       ...        pix                  pix                     pix                    pix         
----------------- ---------------- ------------------ ----------------- ------------- ... ------------------ ---------------------- ----------------------- ---------------------
907.0958140060029 7062.68505859375 0.6708430480957032 906.8458140060029       43200.0 ... 375.10537754190693 0.00047764836926944554     0.01188729889690876   0.08990895003080368
907.5958140060029 7124.16259765625 0.7004288832346598 907.3458140060029       43200.0 ...   375.105621099136 0.00047317345160990953     0.01387943048030138   0.09079039096832275
956.0958140060029 7148.31982421875 0.6926761627197265 955.8458140060029       43200.0 ...  375.0390833485686  0.0004681041755247861 -0.00025425877538509667 -0.004696236923336983
956.5958140060029     7040.7890625 0.7038795948028564 956.3458140060029       43200.0 ...   375.038626957885 0.00047493973397649825 -0.00029425587854348123 -0.005483981221914291

>>> lc_binned = lc.bin(n_bins=300)
>>> print(len(lc_binned), lc_binned[[0, 1, -2, -1]])
300        time             flux            flux_err        time_bin_start  time_bin_size ...     mom_centr2         mom_centr2_err          pos_corr1           pos_corr2     
                    electron / s      electron / s                            s       ...        pix                  pix                   pix                 pix        
----------------- ---------------- ------------------ ----------------- ------------- ... ------------------ ---------------------- ------------------- -------------------
907.0958140060029 7062.68505859375 0.6708430480957032 906.8458140060029       43200.0 ... 375.10537754190693 0.00047764836926944554 0.01188729889690876 0.08990895003080368
907.5958140060029 7124.16259765625 0.7004288832346598 907.3458140060029       43200.0 ...   375.105621099136 0.00047317345160990953 0.01387943048030138 0.09079039096832275
1056.095814006003              nan                nan 1055.845814006003       43200.0 ...                nan                    nan                 nan                 nan
1056.595814006003              nan                nan 1056.345814006003       43200.0 ...                nan                    nan                 nan                 nan

Expected behaviour

The more user-friendly interface to me would offer a functionality equivalent to the old lc.bin(bins=N) or perhaps keeping bins as an additional argument right away, so that the default bin size would be adapted as time_bin_size = (self.time[-1] - self.time[0]) / bins

Then of course one would need to define a priority scheme and appropriate warnings if 2 or more of time_bin_size, bins, n_bins are set...

To be continued.

dhomeier commented 4 years ago

Just quoting my previous comment on the second kwarg that is now missing in aggregate_downsample, binsize as a constant number of data points rather than a bin length in time. I think it would actually not be too hard to implement this by directly constructing a BinnedTimeSeries with time_bin_start = self.time[::bins], but maybe the more important question is whether this binning method is really a better option for signal processing. It could obviously provide more homogeneous bin errors, but would that offset the irregular time sampling?

barentsen commented 4 years ago

Thank you for looking into this @dhomeier! Carefully considering the behavior of lc.bin() has been on my list of things to do before we can release v2.0.

It's not good that I broke the old behavior of bin (i.e. based on binsize) without providing at least the ability to still pass a binsize parameter. I remember having a quick go at adding the binsize parameter but I struggled (I can't remember why), so I left this for later.

Now would be a good time to fix this! Are you interested?

dhomeier commented 3 years ago

I've been looking a bit further, and indeed it seems this is not done so straightforwardly by creating a new BinnedTimeSeries. I guess basically the code in aggregate_downsample following determining of the bin sizes and creation of the new BinnedTimeSeries would have to be replicated. Not a terribly big deal, but I wonder if it would not be a cleaner solution to directly modify aggregate_downsample upstream, but that, being new functionality, would be in 4.2 the earliest; so perhaps a temporary solution within LightCurve is still warranted.

dhomeier commented 3 years ago

The old bin function only accepted one of either bins or binsize, so we should follow that rule and raise a ValueError, too, for any conflicting combinations of bins, n_bins, binsize and time_bin_size.

barentsen commented 3 years ago

I'm starting to think that we should revert to Lightkurve's old bin function (i.e. the one that only accepts bins or binsize), so that we can go ahead with Lightkurve v2.0 and spend the remaining time on fixing docs/tutorials instead of re-inventing the bin method.

In this scenario, users can still apply AstroPy's aggregate_downsample method to bin a LightCurve object in time, we just wouldn't make this the default behavior of bin.

Thoughts?

dhomeier commented 3 years ago

It did not seem to me the old function would work on a TimeSeries object out of the box, but I may be wrong. I'll have to contemplate the function of aggregate_downsample a bit more, but I still think it can be done without too much effort.

lightkurve / lightkurve