man-group / arctic

High performance datastore for time series and tick data
https://arctic.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
3.06k stars 583 forks source link

Read DateRange longer than available data #621

Open paeder92 opened 6 years ago

paeder92 commented 6 years ago

Arctic Version

1.68

Arctic Store

# VersionStore

Platform and version

Windows 10, Anaconda, Python 3.5 environment

Description of problem and/or code sample that reproduces the issue

Hello,

I stored several time series in MongoDB and use date_range to retrieve parts of them, or, as in the case here, to retrieve them in full. For example I use:

t = DateRange( datetime(2015, 1, 1), datetime(2020, 1, 1) ) df = mongo.read("timeseries", date_range=t).data

This should give me the entire time series, since it only holds data from 2016 to 2018 and therefore falls into t entirely, correct? It results in a long series of errors involving pandas and arctic: https://pastebin.com/KVzK0RYX

The error must be related to the data of my timeseries not actually extending to 2020, but only until 2018. If I set the end of the DateRange to 2018, I do not get this error.

I know that I could use t = DateRange() to cover the entire data, but due to compatibility with my remaining code (it is difficult to predict in which case the entire range is needed and when only a piece of it is actually needed), I would prefer not to solve it this way.

Thanks and best regards

bmoscon commented 6 years ago

you can use an open ended date range:

DateRange(dt(2015, 1, 1), None)

Also, the issue could be that the data you have written it not datetime indexed. The dataframe would need to have a column in the index called "date" that is a DateTimeIndex

paeder92 commented 6 years ago

I checked, and the data is datetime indexed and the name of the index is "date". Is there any way to avoid using open ends? I would like to keep my approach as generic as possible, which is why having an over-long DateRange is more convenient.

bmoscon commented 6 years ago

I'd have to have a sample of the data, or some other example that reproduces the issue to know what is going on. It works for me:

lib.read('demo', date_range=DateRange(dt(2015,1,1), dt(2020,1,1))).data

            data
date            
2016-01-01     1
2016-01-02     2
bmoscon commented 6 years ago

the only other thing that comes to mind is an issue with pandas - do you know what version of pandas you have installed?

paeder92 commented 6 years ago

I am using pandas 0.23.0

paeder92 commented 6 years ago

Updated to 0.23.4 now, still the same issue

paeder92 commented 6 years ago

I was able to narrow it down: the problem comes up when inserting the data using append, and only if it is used twice. I tried the following data:

                                   value

date
2015-11-17 13:32:38.636 1 2017-12-07 01:34:54.500 2

and then ran the following:

arc.append("test", df, upsert=True) t = DateRange(datetime(2010,1,1), datetime(2020,1,1)) arc.read("test", date_range=t).data

This works. However, if I again run: arc.append("test", df, upsert=True)

Then it is no longer possible to retrieve the data using arc.read("test", date_range=t).data and an error as described above appears.

paeder92 commented 6 years ago

The issue seems to originate in _daterange of _pandas_ndarray_store.py. I disabled timezones in to_pandas_closed_closed of date._util (although that probably does not have anything to do with the issue) and changed _daterange to the following:

def _daterange(self, recarr, date_range):
    """ Given a recarr, slice out the given artic.date.DateRange if a
    datetime64 index exists """
    idx = self._datetime64_index(recarr)
    if idx and len(recarr):
        dts = recarr[idx]
        mask = Series(np.zeros(len(dts)), index=dts)
        start, end = _start_end(date_range, dts)
        if start < np.datetime64(min(dts)):
            start = np.datetime64(min(dts))
        if end > np.datetime64(max(dts)):
            end = np.datetime64(max(dts))
        mask[start:end] = 1.0
        return recarr[mask.values.astype(bool)]
    return recarr

Now it is working. It seems that sometimes, mask[start:end] struggles with values that are outside of the range of dts and sometimes it does not?

bmoscon commented 6 years ago

i'm guessing its because the datetimeindex is no longer sorted when you append twice.