unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.91k stars 857 forks source link

Predicting Nan with Stock Data (regardless of model) #772

Closed orthosku closed 2 years ago

orthosku commented 2 years ago

Hi there,

running into trouble with predicting Nan Values. Initially thought that this could be from using a weekday timeseries (working with stock data). I saw the post about changing Freq='B' for a business day time index. Even with doing this, the prediction array still has Nan values. I tried using this data with NBEATs as well as RNN model - both have yielded the same results. Would love some help!

Below is the data I'm using as well as the prediction array readout.

 close
timestamp             
1999-11-19  142.500000
1999-11-22  142.468704
1999-11-23  141.218704
1999-11-24  141.968704
1999-11-26  141.437500
1999-11-29  140.937500
1999-11-30  139.281204
1999-12-01  140.406204
1999-12-02  141.250000
1999-12-03  143.843704

[5589 rows x 1 columns]
100%|██████████| 100/100 [27:40<00:00, 16.60s/it]
100%|██████████| 229/229 [00:09<00:00, 24.62it/s]
<TimeSeries (DataArray) (time: 229, component: 1, sample: 1)>
array([[[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

...

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]],

       [[nan]]])
Coordinates:
  * time       (time) datetime64[ns] 2017-09-11 2017-09-18 ... 2022-01-24
  * component  (component) <U1 '0'
Dimensions without coordinates: sample
dennisbader commented 2 years ago

Hi @orthosku. This seems like you have missing values already in your training data. Can you check the following on your input TimeSeries (where ts is any of your TimeSeries)?

ts.pd_dataframe().isna().any()
orthosku commented 2 years ago

Ah yes that returned true. I followed it back and it seems to come from when the ts is created.

nan_in_df = SPY.isnull().sum().sum()
print('Number of NaN values present: ' + str(nan_in_df))

Number of NaN values present: 0

print(SPY.head(n=50))
                 close
timestamp             
1999-11-19  142.500000
1999-11-22  142.468704
1999-11-23  141.218704
1999-11-24  141.968704
1999-11-26  141.437500
1999-11-29  140.937500
1999-11-30  139.281204
1999-12-01  140.406204
1999-12-02  141.250000
1999-12-03  143.843704
1999-12-06  142.781204
1999-12-07  141.625000
1999-12-08  140.718704
1999-12-09  141.406204
1999-12-10  141.875000
1999-12-13  142.125000
1999-12-14  140.750000
1999-12-15  141.500000
1999-12-16  142.125000
1999-12-17  142.687500
1999-12-20  141.656204
1999-12-21  143.812500
1999-12-22  144.187500
1999-12-23  146.484299
1999-12-27  146.281204
1999-12-28  146.187500
1999-12-29  146.812500
1999-12-30  146.640594
1999-12-31  146.875000
2000-01-03  145.437500
2000-01-04  139.750000
2000-01-05  140.000000
2000-01-06  137.750000
2000-01-07  145.750000
2000-01-10  146.250000
2000-01-11  144.500000
2000-01-12  143.062500
2000-01-13  145.000000
2000-01-14  146.968704
2000-01-18  145.812500
2000-01-19  147.000000
2000-01-20  144.750000
2000-01-21  144.437500
2000-01-24  140.343704
2000-01-25  141.937500
2000-01-26  140.812500
2000-01-27  140.250000
2000-01-28  135.875000
2000-01-31  139.562500
2000-02-01  140.937500

series = TimeSeries.from_dataframe(SPY, freq='D', fill_missing_dates=False)
print(series.pd_dataframe().isna().any())

component
close    True
dtype: bool

series = TimeSeries.from_dataframe(SPY, freq='B', fill_missing_dates=False)
print(series.pd_dataframe().isna().any())

component
close    True
dtype: bool

series = TimeSeries.from_dataframe(SPY, freq='B', fill_missing_dates=True)
print(series.pd_dataframe().isna().any())

component
close    True
dtype: bool

Above, I try changing the frequency and filling vs not filling the missing dates. Neither seems to solve the issue. Any ideas? I saw the prior thread that mentioned using freq='b' when dealing w business day data.

dennisbader commented 2 years ago

That is good, and yes, you should use freq='B' for business day frequency. Parameter fill_missing_dates will only insert the missing business days (the dates) as rows with Nan values into your TimeSeries object.

Now to fill the missing values you can take a look at our MissingValuesFiller (https://unit8co.github.io/darts/generated_api/darts.dataprocessing.transformers.missing_values_filler.html)

orthosku commented 2 years ago

I may have found a potential solution:

series = TimeSeries.from_dataframe(SPY, freq='B', fill_missing_dates=False)
series = fill_missing_values(series, fill="auto")
print(series.pd_dataframe().isna().any())

component
close    False
dtype: bool

I'm struggling to understand why if fill_missing_dates = False, then how would there have been any missing values to fill? Wouldn't this function only fill values if an index was present with an 'Nan' in the close column?

dennisbader commented 2 years ago

To copy @hrzn:

Let's say you have daily data with a missing date:

Mon --> 1
Tue --> 2
Thu --> 4
Fri --> 5

Thenfill_missing_dates=True will insert the date with Nan values (in the columns)

Mon --> 1
Tue --> 2
Wed --> NaN
Thu --> 4
Fri --> 5

Finally by filling the missing values:

Mon --> 1
Tue --> 2
Wed --> 3
Thu --> 4
Fri --> 5
orthosku commented 2 years ago

Makes sense, thank you! So in this use case, weekday holidays would marked as Nan by the ts object; these values would be interpolated.

MRV1N2 commented 2 years ago

Hello orthosku, I've also been dealing with darts for a few days to predict Stock Prices, would you like to exchange experience, my Discord name is MRV1N#1905