unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.61k stars 830 forks source link

[QUESTION] It takes a very long time to create a TimeSeries object from a huge xarray.DataArray #2081

Open thingumajig opened 8 months ago

thingumajig commented 8 months ago

Simple steps. There is a large amount of data in netCDF4 format adapted for downloading to Darts:

# Reading:
with xr.open_dataarray(str(data_file), engine='netcdf4', format='NETCDF4', mode='r',
                       chunks={'time': '100MB'}) as da:
    # Print Data Array object
    print(da)

Output:

<xarray.DataArray (time: 11523899, component: 797, sample: 1)>
dask.array<open_dataset-7ad91747aaa917b02ed5686d60f4e86d__xarray_dataarray_variable__, shape=(11523899, 797, 1), dtype=float32, chunksize=(31367, 797, 1), chunktype=numpy.ndarray>
Coordinates:
  * time       (time) datetime64[ns] 2018-07-01 ... 2018-11-30T23:59:59
  * component  (component) object 'MBV31AP001__XQ11' ... 'MBV31CE002__XQ01'
Dimensions without coordinates: sample

Next, too long previous attempts, so I'm resampling:

da1min = da.resample(time="1Min", closed='right').mean()

It takes 17 seconds. That's fine. I get a reduced xarray: image

Then I just try to create Darts TimeSeries from this array in different ways:

from darts import TimeSeries
# ts = TimeSeries.from_xarray(da1min)
ts = TimeSeries(da1min) # just next try to create simple Darts view of data

It's already taking over 11 minutes(!). For the full version of my data, I couldn't wait for this process to be over. Is there cloning going on? But just need a new view of data. Or am I doing something wrong?

Okay, on the other hand, I can do the following:

x = da1min.compute()

It takes about 5 minutes. I get a representation of da1min in memory. Then again

ts = TimeSeries.from_xarray(x)

It takes about 0.4 sec. But for huge data, I think it's a bad way to do it.

Are there any general guidelines for handling data that doesn't fit in memory?

System:

dennisbader commented 7 months ago

Hi @thingumajig, and thanks for writing.

I can't tell where exactly the issue is coming from (can't reproduce the data), but I guess it comes from creating a copy of data when creating a TimeSeries in this line.

The reason is that we guarantee that each TimeSeries is immutable (and the source is not mutated), to avoid a lot of pit falls down the line.

We're always open for suggestions on how to improve things, as long as we can keep these guarantees.

thingumajig commented 7 months ago

Hi @dennisbader, thank you for your reply

...I guess it comes from creating a copy of data when creating a TimeSeries in this line. The reason is that we guarantee that each TimeSeries is immutable (and the source is not mutated), to avoid a lot of pit falls down the line.

Looked at the _sort_index source code and saw a monotonicity check. Looked again at my data, turns out I have at least a time gap. I'll look at my data again.

But still, maybe gigantic data would be better represented as multiple series? And kind of neatly create delayed TimeSeries in the Dask sense?