Closed murometz closed 3 years ago
Hi murometz, it seems that you may have nans in the unique_id column https://stackoverflow.com/questions/26266362/how-to-count-the-nan-values-in-a-column-in-pandas-dataframe
We are working currently on a new repository with N-BEATS, N-BEATSx and ES-RNN, this repository will soon be pointed towards it.
Hi Kin, thank you for fast response. Unfortunatelly not. y_train_df.unique_id.isna().sum() X_train_df.unique_id.isna().sum() both are 0
There are no nans in either column of these dataframes.
Are there any other requirements on IDs except they should be unique?
Thank you!
In principle the unique_idxs being unique is sufficient. Some corner cases could escape.
It seems to me that what is giving you trouble is those lines from the Iterator in this file https://github.com/kdgutier/esrnn_torch/blob/master/ESRNN/utils/data.py First, just to confirm do you get the error in those lines? Second, the assertion is already printing the potential unique_idxs to try to guide towards the data cleaning. Do you see the printed unique_idxs? If that is not the case it means that it is printing an empty list.
The iterator seems to be a bit delicate and the unique_idxs are assumed to be the column 0 from the X matrix in line 71.
I have subsetted the dataframe on these IDs. Nothing special there. Then I have excluded them all. Next bunch appeared...
These are weekly data. May be, the exact week between dates is violated somewhere. Can it be the reason? Or some settings in Model?
model = ESRNN(max_epochs=5, freq_of_test=1, batch_size=32, learning_rate=0.02, per_series_lr_multip=0.5, lr_scheduler_step_size=7, lr_decay=0.5, gradient_clipping_threshold=50, rnn_weight_decay=0.0, noise_std=0.001, level_variability_penalty=30, testing_percentile=50, training_percentile=50, ensemble=True, max_periods=140, seasonality=[52], input_size=24, output_size=2, cell_type='LSTM', state_hsize=40, dilations=[[1, 4, 24, 52]], add_nl_layer=False, random_seed=1, device='cpu')
It is more plausible for it to be an issue with the dataframes. Another potential reason is that there are missing values for 'y' column between weeks for some unique_ids. The model assumes the data to be dense between the start to end dates. The assertions comes from checking nans from the 'y' variable.
It seems that the problem comes from this script https://github.com/kdgutier/esrnn_torch/blob/master/ESRNN/ESRNN.py in the long_to_wide wrangling method (an extra guess is around the pivot pandas function)
Hope it helps
If there are places in data with consecutive timesteps like 1970-01-01 and next row not one week later but e.g. 1970-01-15, would it produce nan?
My suggestion is to run from line 572 to line 591 in a notebook with your dataframes and check the rows for nans. And my guess is that problem comes from the pivot pandas function that transforms from long to wide and fills with nans the y matrix. In case for example there is a series with weekly observations and a series with biweekly observations.
If that is the case balancing the dataset is a possible solution https://stackoverflow.com/questions/45839316/pandas-balancing-data
Your guess is correct. There are nans in y. Thank you very much!
So these nans are coming from non consecutive (not exact 7 days difference) dates?
Yes, from the pandas pivot function. Glad to help.
Thank you very much for your help!!
Hello Getting this assertion error. There are no nans in the X_train_df and y_train_df.
Thank you in advance!
Regards