catherinening commented 8 months ago

Prerequisites

[ x]

Describe the bug

I am trying to build a global or global/local model with monthly time series data, based on this tutorial, with dates at the start of each month ranging up to three years. There are ~200 series (of subscriber cohorts), and each series ranges from 1 to 36 observations, but the most recent observation is the same date across all series.

I am repeatedly getting a ValueError: Invalid frequency: NaT when running NeuralProphet().test on the test data set, obtained after running the NeuralProphet.split_df() method. I initially got the

Note: I had run into this error earlier, when trying to split the dataset when calling NeuralProphet().split_df(df, freq='MS', local_split=True). I was able to resolve the below issue by NOT converting my 'ds' column in my DaraFrame to pd.datetime before passing it into split_data(), and also removing series with very few (<5) samples, so that the number of training samples is guaranteed to be > 1.

To Reproduce

Steps to reproduce the behavior:

Start with a df, containing ~200 series, each corresponding to subscribers that signed up in the same month. Observations are collected monthly at the start of each month, ranging from July 2020 - June 2023. The response variable is the number of remaining subscribers ('y') at each date ('ds') Some time series will have all 36 observations, but some may have as few as one, reflecting new groups of subscribers. However, as noted above, I removed any series with five or fewer observations.

I initiated the NeuralProphet instance, split the data into training and test, fit the model, and made predictions. I try to make predictions on the test set using m.test(), and that's when I get the error

m = NeuralProphet(
trend_global_local="local",
season_global_local="local",
changepoints_range=0.8,
epochs=20,
trend_reg=5,
)
m.set_plotting_backend("plotly-static")
df_train, df_test = m.split_df(monthly_df, freq='MS', valid_p=0.33, local_split=True)
metrics = m.fit(df_train, freq="MS")
future = m.make_future_dataframe(df_train, periods=12, n_historic_predictions=True)
forecast = m.predict(future)
test_metrics_local = m.test(df_test)

After running test_metrics_local = m.test(df_test), this is the full error message I get:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-51-c7dc5fdc4eb8>](https://localhost:8080/#) in <cell line: 1>()
----> 1 test_metrics_local = m.test(df_test)
  2 test_metrics_local

5 frames /usr/local/lib/python3.10/dist-packages/pandas/core/arrays/datetimes.py in _generate_range(cls, start, end, periods, freq, tz, normalize, ambiguous, nonexistent, inclusive, unit) 419 "and freq, exactly three must be specified" 420 ) --> 421 freq = to_offset(freq) 422 423 if start is not None:

offsets.pyx in pandas._libs.tslibs.offsets.to_offset()

ValueError: Invalid frequency: NaT


**Expected behavior**
I was following the code in the tutorial linked earlier, and expected to see test performance metrics. 

**What actually happens**

See above; this happens every time I try to run the code. 

**Environment (please complete the following information):**
Google Colab, using the following pip commands prior to passing in import statements

if "google.colab" in str(get_ipython()):

uninstall preinstalled packages from Colab to avoid conflicts

!pip uninstall -y torch notebook notebook_shim tensorflow tensorflow-datasets prophet torchaudio torchdata torchtext torchvision
# !pip install git+https://github.com/ourownstory/neural_prophet.git # may take a while
!pip install neuralprophet # much faster, but may not have the latest upgrades/bugfixes

!pip install -U kaleido

ourownstory commented 8 months ago

Hi @catherinening Thank you for raising this issue with detailed description. Could you please include a minimal toy/synthetic dataset that triggers the same issue, so I can reproduce and debug this? Thank you.

catherinening commented 8 months ago

dummy_dataset.csv

Hi @ourownstory , here is a small synthetic dataset that triggers the error I described.

in the meantime, are there other ways I can calculate prediction error?

ourownstory commented 4 months ago

@catherinening Please excuse the late follow-up. Did you find a solution to this? I suspect, that some of your series still had no or insufficient observations in the training data after the split, and thus got omitted by the model. They may however be present in the test dataframe, leading to an issue. It should however fail with a clear message. Might you have a full trace of the calls leading to this pandas error?

In the meantime, you can screen your training dataframe and remove all series with insufficient samples from there and from the test dataframe. If that does not resolve it, you could call test() iteratively for each series until you catch the error, then you know which one to further investigate.

ourownstory commented 4 months ago

@MaiBe-ctrl Do you mind checking if you get the same error with the dummy dataset?

MaiBe-ctrl commented 4 months ago

This happens when the test dataframe is too small, the inferred frequency is then set to NaT. Increasing the split quota from 0.33 to 0.4 solves the problem. @ourownstory we ca solve this issue by raising an exception in case the test dataframe is too small to infer the frequency, what do you think?

ourownstory / neural_prophet

Neural Prophet is throwing `ValueError: Invalid frequency: NaT` when running .test() #1550

uninstall preinstalled packages from Colab to avoid conflicts