Imputation with nan's produces loss to be nan

watersoup commented 1 year ago

Hi George, Imputation using this package is bit confusion, tried keeping nan's for to be imputed, but my loss is obviously 1, cannot use 0 because it can be legitimate value. I tried -1 but seems like the its throwing ZeroDivisionError after 1st epoch. I can give you my data i.e. wastewater class. Here is my json config file { "data_dir": "\\Hscpigdcapmdw05\sas\Use....\inputdata", "output_dir": "\\Hscpigdcapmdw05\sas\Use...\mvts_imputed", "model" : "transformer", "data_class": "wastewater", "task" : "imputation", "d_model": 64, "activation" : "relu", "num_heads" : 4, "num_layers": 8, "pos_encoding": "learnable", "epochs" : 10, "normalization": "minmax", "test_ratio" : 0.1, "val_ratio": 0.05, "mean_mask_length": 6, "mask_mode": "concurrent", "mask_distribution": "bernoulli", "exclude_feats" : ["geoLat", "geoLong", "phureg", "sewershedPop"], "data_window_len": 15, "lr": 0.001, "batch_size": 5, "masking_ratio": 0.05 }

watersoup commented 1 year ago

I think this has been fixed, this new jason, and making sure to have NAN's as np.nan. { "data_dir": "../inputdata", "output_dir": "../mvts_imputed", "model" : "transformer", "data_class": "wastewater", "records_file" : "../mvts_imputed/ImputationRecords.xls", "print_interval" : 10,

"mean_mask_length": 6,
"mask_mode": "concurrent",
"mask_distribution": "bernoulli",

"task" : "imputation",
"no_timestamp" : 1 ,
"d_model": 128,
"activation" : "relu",
"num_heads" : 8,
"num_layers": 8,
"pos_encoding": "learnable",
"epochs" : 200,
"normalization": "minmax",

"val_ratio": 0.2,
"val_interval": 5,

"exclude_feats" : ["geoLat", "geoLong", "phureg", "sewershedPop"],
"data_window_len": 15,
"batch_size": 10

}

gzerveas commented 1 year ago

Great to see that you fixed your issue. Did you use np.nan as a masking value?

watersoup commented 1 year ago

Great to see that you fixed your issue. Did you use np.nan as a masking value?

Hi George, Yes i had to make use of np.nan for missing values in the dataset, and increasing validation ratio from 5% to 20% Jag

watersoup commented 1 year ago

Hi George, Seems like that, np.nan is not actually working. I am not sure why it worked earlier. Is there any other way we can actually make the missing work ? or can you tell me where and how can let it adapt the missingness in the training data can be dropped while training ? Thanks Jag

gzerveas commented 1 year ago

Ok, let me clarify a couple of things: When training, you need to have complete data (no missing values), and you simply train with the self-supervised imputation objective. During inference, you must know the indices of the missing values (i.e. which values are actually missing), but what values you use as fillers is up to you. You only need to ensure that the filler values between training and inference are consistent.

You therefore don't need NaN values to represent missingness (and they can easily cause trouble). Missingness is represented with the boolean mask, (seq_length, feat_dim) array for each sample, which corresponds to the missing indices. This mask is produced by noise_mask within the __get_item__ of ImputationDataset during training, but during inference must be provided by you, i.e. by your data.py class and then passed on to a Dataset class that can be a subclass of ImputationDataset almost identical to the original, but simply gets the masks directly from your data.py class instead of calling noise_mask to generate them on the fly. These masks are then used within the collate function to define the target_masks (you don't need to change these) , and to transform the X features tensor. Currently, this transformation enters zeroes where the values are missing (these are the zeroes of the boolean mask). However, if you think zeroes won't work for you (I suggest you first give it a try), you can set the corresponding elements of X to an arbitrary value outside the range of your features, e.g. -77. Again, just make sure that you do this consistently both for training as well as for inference - this simply means changing this line within collate_unsuperv.

watersoup commented 1 year ago

HI George,

Greatly appreciated your elaborate email, I will update my methodology and check it out today.

Thanks, Jag

From: George Zerveas @.> Sent: April 5, 2023 9:05 PM To: gzerveas/mvts_transformer @.> Cc: watersoup @.>; Author @.> Subject: Re: [gzerveas/mvts_transformer] Imputation with nan's produces loss to be nan (Issue #41)

Ok, let me clarify a couple of things: When training, you need to have complete data (no missing values), and you simply train with the self-supervised imputation objective. During inference, you must know the indices of the missing values (i.e. which values are actually missing), but what values you use as fillers is up to you. You only need to ensure that the filler values between training and inference are consistent.

You therefore don't need NaN values to represent missingness (and they can easily cause trouble). Missingness is represented with the boolean mask which corresponds to the missing indices. This mask is produced by noise_mask within the __get_item__ of ImputationDataset during training, but during inference must be provided by you, i.e. by your data.py class and then passed on to a Dataset class that can be a subclass of ImputationDataset almost identical to the original, but simply gets the masks directly from your data.py class instead of calling noise_mask to generate them. These masks are then used within the collate functionhttps://github.com/gzerveas/mvts_transformer/blob/3f2e378bc77d02e82a44671f20cf15bc7761671a/src/datasets/dataset.py#L210 to define the target_masks (you don't need to change these) , and to transform the X features tensorhttps://github.com/gzerveas/mvts_transformer/blob/3f2e378bc77d02e82a44671f20cf15bc7761671a/src/datasets/dataset.py#L225. Currently, this transformation enters zeroes where the values are missing (these are the zeroes of the boolean mask). However, if you think zeroes won't work for you (I suggest you first give it a try), you can set the corresponding elements of X to an arbitrary value outside the range of your features, e.g. -77. Again, just make sure that you do this consistently both for training as well as for inference - this simply means changing this linehttps://github.com/gzerveas/mvts_transformer/blob/3f2e378bc77d02e82a44671f20cf15bc7761671a/src/datasets/dataset.py#L225 within collate_unsupev.

— Reply to this email directly, view it on GitHubhttps://github.com/gzerveas/mvts_transformer/issues/41#issuecomment-1498160684, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADT3T3A3MFM6Q4N427LLOIDW7XNCZANCNFSM6AAAAAAVLAS2SE. You are receiving this because you authored the thread.Message ID: @.***>

gzerveas / mvts_transformer

Imputation with nan's produces loss to be nan #41