AssertionError: clean np.nan's from unique_idxs unbalanced Y_df

murometz commented 3 years ago

Hello Getting this assertion error. There are no nans in the X_train_df and y_train_df.

Thank you in advance!

Regards

kdgutier commented 3 years ago

Hi murometz, it seems that you may have nans in the unique_id column https://stackoverflow.com/questions/26266362/how-to-count-the-nan-values-in-a-column-in-pandas-dataframe

We are working currently on a new repository with N-BEATS, N-BEATSx and ES-RNN, this repository will soon be pointed towards it.

murometz commented 3 years ago

Hi Kin, thank you for fast response. Unfortunatelly not. y_train_df.unique_id.isna().sum() X_train_df.unique_id.isna().sum() both are 0

There are no nans in either column of these dataframes.

Are there any other requirements on IDs except they should be unique?

Thank you!

kdgutier commented 3 years ago

In principle the unique_idxs being unique is sufficient. Some corner cases could escape.

It seems to me that what is giving you trouble is those lines from the Iterator in this file https://github.com/kdgutier/esrnn_torch/blob/master/ESRNN/utils/data.py Screen Shot 2021-05-03 at 2 37 10 PM First, just to confirm do you get the error in those lines? Second, the assertion is already printing the potential unique_idxs to try to guide towards the data cleaning. Do you see the printed unique_idxs? If that is not the case it means that it is printing an empty list.

The iterator seems to be a bit delicate and the unique_idxs are assumed to be the column 0 from the X matrix in line 71.

murometz commented 3 years ago

Correct, this is the place
Correct, I see the list of potential unique_idxs

I have subsetted the dataframe on these IDs. Nothing special there. Then I have excluded them all. Next bunch appeared...

These are weekly data. May be, the exact week between dates is violated somewhere. Can it be the reason? Or some settings in Model?

model = ESRNN(max_epochs=5, freq_of_test=1, batch_size=32, learning_rate=0.02, per_series_lr_multip=0.5, lr_scheduler_step_size=7, lr_decay=0.5, gradient_clipping_threshold=50, rnn_weight_decay=0.0, noise_std=0.001, level_variability_penalty=30, testing_percentile=50, training_percentile=50, ensemble=True, max_periods=140, seasonality=[52], input_size=24, output_size=2, cell_type='LSTM', state_hsize=40, dilations=[[1, 4, 24, 52]], add_nl_layer=False, random_seed=1, device='cpu')

kdgutier commented 3 years ago

It is more plausible for it to be an issue with the dataframes. Another potential reason is that there are missing values for 'y' column between weeks for some unique_ids. The model assumes the data to be dense between the start to end dates. The assertions comes from checking nans from the 'y' variable.

It seems that the problem comes from this script https://github.com/kdgutier/esrnn_torch/blob/master/ESRNN/ESRNN.py in the long_to_wide wrangling method (an extra guess is around the pivot pandas function) Screen Shot 2021-05-03 at 3 03 15 PM

Hope it helps

murometz commented 3 years ago

If there are places in data with consecutive timesteps like 1970-01-01 and next row not one week later but e.g. 1970-01-15, would it produce nan?

kdgutier commented 3 years ago

My suggestion is to run from line 572 to line 591 in a notebook with your dataframes and check the rows for nans. And my guess is that problem comes from the pivot pandas function that transforms from long to wide and fills with nans the y matrix. In case for example there is a series with weekly observations and a series with biweekly observations.

If that is the case balancing the dataset is a possible solution https://stackoverflow.com/questions/45839316/pandas-balancing-data

murometz commented 3 years ago

Your guess is correct. There are nans in y. Thank you very much!

So these nans are coming from non consecutive (not exact 7 days difference) dates?

kdgutier commented 3 years ago

Yes, from the pandas pivot function. Glad to help.

murometz commented 3 years ago

Thank you very much for your help!!

kdgutier / esrnn_torch

AssertionError: clean np.nan's from unique_idxs unbalanced Y_df #29