grd349 / PBjam

A repo for our peak baggin code and tips on jam
MIT License
17 stars 6 forks source link

Timeseries with zero median causes Lightkurve to complain about NaN values #159

Closed warrickball closed 4 years ago

warrickball commented 5 years ago

Hi,

The following snippet

import numpy as np
import pbjam

N = 1000
timeseries = np.vstack([np.arange(N)*1800., np.random.randn(N)])
timeseries[1] -= np.median(timeseries[1])
timeseries[1] -= np.median(timeseries[1]) # yes, I mean it
s = pbjam.session(ID='0', numax=(80.0, 5.0), dnu=(6.3, 0.2),
                  teff=(5000, 50), bp_rp=(1.0, 0.01), make_plots=True,
                  timeseries=timeseries)
s()

causes this error in lightkurve:

/home/wball/.local/lib/python3.7/site-packages/lightkurve/periodogram.py:744: LightkurveWarning: Input light curve will be normalized.
  LightkurveWarning)
Traceback (most recent call last):
  File "PBjam_bug_3.py", line 14, in <module>
    timeseries=timeseries)
  File "/home/wball/pypi/PBjam/pbjam/jar.py", line 604, in __init__
    lk_to_pg(vardf)
  File "/home/wball/pypi/PBjam/pbjam/jar.py", line 431, in lk_to_pg
    vardf.loc[i, key] = lk_lc.to_periodogram(freq_unit=units.microHertz, normalization='psd').flatten()
  File "/home/wball/.local/lib/python3.7/site-packages/lightkurve/lightcurve.py", line 1230, in to_periodogram
    return LombScarglePeriodogram.from_lightcurve(lc=self, **kwargs)
  File "/home/wball/.local/lib/python3.7/site-packages/lightkurve/periodogram.py", line 760, in from_lightcurve
    raise ValueError('Lightcurve contains NaN values. Use lc.remove_nans()'
ValueError: Lightcurve contains NaN values. Use lc.remove_nans() to remove NaN values from a LightCurve.

This seems to be because lightkurve insists on normalising the light curve by dividing by the median, which presumably leads to non-finite values that lightkurve's error describes as NaNs (even though I'm pretty sure the problem is Infs, not NaNs). I noticed that, when the timeseries is provided as a filename, PBjam handles this (in jar.lc_to_lk()) by adding a small offset but this doesn't seem to apply if the timeseries is provided as an array.

grd349 commented 5 years ago

Thanks @warrickball - I think this all comes from the normalisation. We should put an if statement around the LC prep that doesn't normalise if LC.flux.mean() < tol.

nielsenmb commented 5 years ago

This particular aspect of LightKurve is being overhauled at the moment. Should be fixed by requiring PBjam updates to the latest version of LightKurve (in the near future).

ojhall94 commented 5 years ago

This should be fixed now with v1.2.0. We should try it out, I'm not sure if behaviour will be consistent.

ojhall94 commented 5 years ago

Okay so having played around with it, to retain periodogram units of ppm^2/microHz, we need to call:

lc.normalize(unit='ppm').to_periodogram(normalization='psd')

whereas before we simply called:

lc.normalize().to_periodogram(normalization='psd').

I think this should work for zero-centered data as well, but I haven't tried that yet.

grd349 commented 5 years ago

Thanks @ojhall94 - we should update the dependency to be lightcurve==1.2.0 and then put in the updated unit in normalize. @alexlyttle - do you want to have a swing at this one?

alexlyttle commented 5 years ago

I could give that a go, will update lightkurve and this out on my fork.

grd349 commented 5 years ago

@alexlyttle - do you know about the dependencies in setup.py? If no, let's talk tomorrow.

alexlyttle commented 5 years ago

@grd349 I think so, I believe it's just a case of changing lightkurve>=1.0.1 to lightkurve>=1.2.0 in setup.py to make sure that the version of lightkurve installed is greater than or equal to 1.2.0.

nielsenmb commented 5 years ago

If you guys are going to be playing with the dependencies in setup.py, can you make it pull the dependencies from requirements.txt? Then we only need to update the dependencies in one place.

It could maybe be done with an easy open('requirements.txt', 'r').read() call in the setup.py, but I'm not sure.

The requirements.txt is what sphinx/readthedocs uses to make compile everything for the docs page.

On Thu, 3 Oct 2019 at 11:50, Guy Davies notifications@github.com wrote:

@alexlyttle https://github.com/alexlyttle - do you know about the dependencies in setup.py? If no, let's talk tomorrow.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grd349/PBjam/issues/159?email_source=notifications&email_token=AEJWO324MVSDVIRFUKCSV7TQMXE6RA5CNFSM4ITPFIR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAHZTCY#issuecomment-537893259, or mute the thread https://github.com/notifications/unsubscribe-auth/AEJWO34B5LKNTLZWA2DMI7DQMXE6RANCNFSM4ITPFIRQ .

alexlyttle commented 5 years ago

@nielsenmb that makes sense, I can also try that on my fork and see what happens. Will chat with @grd349 tomorrow about it too.

nielsenmb commented 5 years ago

There appears to be a bug in the lightkurve normalize function, so you get the wrong scale of the offset on the flux when using normalize('ppm'). Basically it looks like they've forgotten to subtract 1 when converting from flux to ppm, so you wind up with a mean of 1e6, rather than 0 after converting.

It might still work for our purposes though, since to_periodogram should handle offsets in the input time series.

Just so you're aware

alexlyttle commented 5 years ago

@nielsenmb thanks for the heads-up!

alexlyttle commented 5 years ago

@nielsenmb I modified the normalize() function in Lightkurve so that it takes away 1 from the flux before converting to percent, ppt and ppm. I intend to create a pull request/issue on lightkurve about this, but want to mention some of the side-effects of the change first.

For example, I noticed flattening the light curve before/after normalisation massively changes the appearance of the light curve. This could be because of the behaviour of lightcurve.flatten() near zero. See the attached pdf for plots (using the example star in PBjam docs): lightkurve-test.pdf

This might be more appropriate in a separate Lightkurve issue but I thought it would be useful here too.

alexlyttle commented 5 years ago

I tried running the script in the first comment of this thread and it throws up an error when trying to initialise session. @warrickball, do you know which version of PBjam you ran this on? I was looking back to see if there was a change in the code somewhere that is causing this, but it is pretty difficult to pinpoint so I will leave it here for now.

Here is the error which arises after running the script, using PBjam version 0.1.6 on MacOS 10.15:


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'timeseries'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/internals/managers.py in set(self, item, value)
   1068         try:
-> 1069             loc = self.items.get_loc(item)
   1070         except KeyError:

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'timeseries'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-2-ebe582da4175> in <module>
      1 s = pb.session(ID='0', numax=(80.0, 5.0), dnu=(6.3, 0.2),
      2                teff=(5000, 50), bp_rp=(1.0, 0.01),
----> 3                timeseries=timeseries)

~/Projects/PBjam/pbjam/session.py in __init__(self, ID, numax, dnu, teff, bp_rp, timeseries, psd, dictlike, use_cached, cadence, campaign, sector, month, quarter, path, download_dir)
    620                                         campaign=campaign, sector=sector,
    621                                         month=month, quarter=quarter)
--> 622             format_col(vardf, timeseries, 'timeseries')
    623             format_col(vardf, psd, 'psd')
    624 

~/Projects/PBjam/pbjam/session.py in format_col(vardf, col, key)
    357         x = np.array(col[0, :], dtype=float)
    358         y = np.array(col[1, :], dtype=float)
--> 359         vardf[key] = np.array([arr_to_lk(x, y, vardf['ID'][0], key)])
    360 
    361     # If dim = 3, it's a list of arrays or tuples

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/frame.py in __setitem__(self, key, value)
   3470         else:
   3471             # set column
-> 3472             self._set_item(key, value)
   3473 
   3474     def _setitem_slice(self, key, value):

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/frame.py in _set_item(self, key, value)
   3548         self._ensure_valid_index(value)
   3549         value = self._sanitize_column(key, value)
-> 3550         NDFrame._set_item(self, key, value)
   3551 
   3552         # check if we are modifying a copy

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/generic.py in _set_item(self, key, value)
   3379 
   3380     def _set_item(self, key, value):
-> 3381         self._data.set(key, value)
   3382         self._clear_item_cache()
   3383 

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/internals/managers.py in set(self, item, value)
   1070         except KeyError:
   1071             # This item wasn't present, just insert at end
-> 1072             self.insert(len(self.items), item, value)
   1073             return
   1074 

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1179         new_axis = self.items.insert(loc, item)
   1180 
-> 1181         block = make_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1))
   1182 
   1183         for blkno, count in _fast_count_smallints(self._blknos[loc:]):

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype, fastpath)
   3265         values = DatetimeArray._simple_new(values, dtype=dtype)
   3266 
-> 3267     return klass(values, ndim=ndim, placement=placement)
   3268 
   3269 

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
   2773             values = np.array(values, dtype=object)
   2774 
-> 2775         super().__init__(values, ndim=ndim, placement=placement)
   2776 
   2777     @property

~/.virtualenvs/pbjam/lib/python3.7/site-packages/pandas-0.25.1-py3.7-macosx-10.14-x86_64.egg/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
    126             raise ValueError(
    127                 "Wrong number of items passed {val}, placement implies "
--> 128                 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs))
    129             )
    130 

ValueError: Wrong number of items passed 1000, placement implies 1```
nielsenmb commented 5 years ago

Getting the same error as Alex, investigating...

nielsenmb commented 5 years ago

Seems like the offending line is in session.py line 359 vardf[key] = np.array([arr_to_lk(x, y, vardf['ID'][0], key)])

For some obscure reason when the output from arr_to_lk is put in an array, it messes up something with creating a new column in a PandasDataframe. Replacing it with vardf[key] = [arr_to_lk(x, y, vardf['ID'][0], key)] seems to have solved this for me...

@alexlyttle can you try changing this and seeing if it works on your end? Maybe test it on a real time series too. Feel free to include it in pull request #171 if it does. Then I'll merge it.

alexlyttle commented 5 years ago

That's odd, but if that fixes it then it should be fine. I'll give that a go and add it to the pull request if it works.

warrickball commented 5 years ago

I don't think I've updated anything since opening this issue and I still get the same error. I seem to have PBjam version 0.1.0:

$ pip3 show pbjam
Name: pbjam
Version: 0.1.0
...

Specifically, I've got commit 404d2edb1023a33bdeaa2ceb4543414e3a8bbace:

git log | head
commit 404d2edb1023a33bdeaa2ceb4543414e3a8bbace
Author: grd349 <grd349@gmail.com>
Date:   Thu Aug 22 17:03:47 2019 +0100

    Various updates to prior and stage2 checking
...

I won't try upgrading yet in case it makes it hard to reproduce the error.

alexlyttle commented 5 years ago

Cheers @warrickball. In the latest version of PBjam on git and with lightkurve v1.2.0, after adding to my fork the fix suggested by @nielsenmb, I don't get the lightkurve error you had, or the pandas KeyError I quoted.

Now I just get this warning upon initialising session:

UserWarning: Input numax is greater than Nyquist frequeny for 0

but otherwise it initialises fine. Then, when I call session it complains that the prior is very large and may be slow - presumably because the custom timeseries is random so it's going to struggle?

nielsenmb commented 5 years ago

Yup, the numax warning is probably cause you are using a fake time series. You can try using a real timeseries.

The warning about the slow KDE occurs when the prior data it's being trained on contains more than 1000 targets. This probably occurs because of some combination of large errors on numax and dnu and that you are in a particularly dense part of the prior sample.

On Thu, 10 Oct 2019 at 13:52, Alex Lyttle notifications@github.com wrote:

Cheers @warrickball https://github.com/warrickball. In the latest version of PBjam on git and with lightkurve v1.2.0, after adding to my fork the fix https://github.com/grd349/PBjam/issues/159#issuecomment-540424072 suggested by @nielsenmb https://github.com/nielsenmb, I don't get the lightkurve error you had, or the pandas KeyError I quoted https://github.com/grd349/PBjam/issues/159#issuecomment-540093168.

Now I just get this warning upon initialising session:

UserWarning: Input numax is greater than Nyquist frequeny for 0

but otherwise it initialises fine. Then, when I call session it complains that the prior is very large and may be slow - presumably because the custom timeseries is random so it's going to struggle?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grd349/PBjam/issues/159?email_source=notifications&email_token=AEJWO345DVGQBKDCTJ4WBMTQN33PVA5CNFSM4ITPFIR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA3ULJA#issuecomment-540493220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJWO37B4JZS2DMAEMJYC6DQN33PVANCNFSM4ITPFIRQ .

alexlyttle commented 5 years ago

It works fine using a real time series, no warnings and no issues after implementing your suggested fix @nielsenmb.

nielsenmb commented 5 years ago

Cool, feel free to commit the changes to your pull request, and I'll merge it into the master branch

On Thu, 10 Oct 2019 at 14:30, Alex Lyttle notifications@github.com wrote:

It works fine using a real time series, no warnings and no issues after implementing your suggested fix @nielsenmb https://github.com/nielsenmb.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grd349/PBjam/issues/159?email_source=notifications&email_token=AEJWO365FVTL5MKU6ZYED7DQN3727A5CNFSM4ITPFIR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA3XXSQ#issuecomment-540507082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJWO3ZWCFNGINNU4ZEJI7DQN3727ANCNFSM4ITPFIRQ .

warrickball commented 5 years ago

I upgraded yesterday afternoon so now have version 0.1.6 (and lightkurve 1.2.0). The same script seems to run to completion with the following output, which may or may not be worrying:

/home/wball/pypi/PBjam/pbjam/session.py:645: UserWarning: Input numax is greater than Nyquist frequeny for 0
  warnings.warn("Input numax is greater than Nyquist frequeny for %s" % (st.ID))
/home/wball/pypi/PBjam/pbjam/priors.py:66: UserWarning: You have lots data points in your prior - estimating the KDE band width will be slow!
  warnings.warn('You have lots data points in your prior - estimating' +
 72%|████████████████████████████████████████████▏                | 14496/20000 [06:08<02:15, 40.61it/s]Converged after 14500 iterations.
 72%|████████████████████████████████████████████▏                | 14500/20000 [06:08<02:19, 39.32it/s]
100%|█████████████████████████████████████████████████████████████| 20000/20000 [11:19<00:00, 29.42it/s]
Star 0 produced an exception of type ValueError occurred. Arguments:
('min() arg is an empty sequence',)

I've run it twice and got the same result so it probably isn't random.