Decode groupIDs and targets

dorienh commented 1 year ago

I use the column filename to indicate my different timeseries (see loader code at the bottom). During testing, I am trying to match predictions with predictions with the original dataset. All is going ok, except, the filename is now an integer, not a filename. How do I decode the original filename from the batch['x']['groups']?

Is groups not the info to get? And how to decode? Would that be in the dataset.get_parameters()?

def test_step(self, batch, batch_idx):

  x, y = batch 
  for i in range(0, len(x["encoder_cont"])):
      row = [x['decoder_time_idx'][i][-1].tolist(), x['groups'][i][0].tolist(), x['decoder_target'][i][0].tolist(), preds[i].tolist()]
       self.datarow.append(row)

Later on after all steps are completed, I convert to dataframe:

df = pd.DataFrame(self.datarows, columns=['idx', 'group', 'y', 'y_hat'])

Then I can match y_hats per row based on matching idx and group in the original dataset.

Am I missing an easier way to do this?

This is my TimeSeries init:

        self.training_dataset = TimeSeriesDataSet(
            dataset[lambda x: x['idx'] <= training_cutoff],
            time_idx='idx', 
            target="y",
            group_ids=["filename"], # groups different time series
            min_encoder_length=max_encoder_length,  
            max_encoder_length=max_encoder_length,
            min_prediction_length=1, 
            max_prediction_length=max_prediction_length,
            static_categoricals=[],
            static_reals=[],
            time_varying_known_categoricals=[],  # if time shifted. 
            time_varying_known_reals=['ATR', 'Open', 'High', 'Low', 'Close', 'Volume', 'CC', 'Close_pct' , 'Volume_pct'], ## add other variables later on
            time_varying_unknown_categoricals=['y'],
            time_varying_unknown_reals=[], #list of continuous variables that change over time and are not know in the future
            categorical_encoders={'filename': pytorch_forecasting.data.encoders.NaNLabelEncoder(add_nan=True), 'y': pytorch_forecasting.data.encoders.NaNLabelEncoder(add_nan=False)}, ## how are nans processed? there should be none. 
            # scalers= {"Close": None, "idx": None, 'Volume_pct':None}, #StandardScaler, Defaults to sklearn’s StandardScaler()
            target_normalizer=pytorch_forecasting.data.encoders.NaNLabelEncoder(),
            # target_normalizer=NaNLabelEncoder(),
            add_relative_time_idx=False, 
            add_target_scales=False, ## what is this? 
            add_encoder_length=False,
            allow_missing_timesteps=False, # does not allow idx missing
            predict_mode = False #To get only last output
        )

dorienh commented 1 year ago

I found the magic function dataset.x_to_index(), which allows me to get the idx and group from x.

I am still struggling to find a very simple map for my target categorical variable. In batch.y it is 0 1 2, whereas in the original dat it's a string (Buy/Sell/None). It seems like there should be a simple mapping saved somewhere...

For the record here is how I merge my original dataframe with predictions:

 def test_step(self, batch, batch_idx):
        x, y = batch  #not sure if instead should be batch[0], batch[1]
        y_hat = self(x)
        loss = self.criterion(y_hat, y[0].squeeze(1))
        self.log('test_loss', loss, batch_size=self.batch_size)

        # convert to predictions
        preds = torch.argmax(y_hat, axis = 1)
        self.test_step_y_hats.append(preds)
        self.test_step_ys.append(y[0].squeeze(1))
        self.test_step_xs.append(x["encoder_cont"])

        # save predictions for reconstruction
        for i in range(0, len(x["encoder_cont"])):
            row = [x['decoder_target'][i][0].tolist(), preds[i].tolist()]
            self.datarow.append(row)

        self.databatch_x.append(x)

Then in my main function I initiate the test phase and call this and merge with:

   ```
    self.trainer.test(model = self.model, dataloaders=dataloader)
    # get x, y, y_hat, mapping
    datarows = self.model.get_datarows()
    print('Number of datarows: ' + str(len(datarows)))
    df = pd.DataFrame(datarows, columns=['y_orig', 'y_hat'])

    # get mapping to groups for y
    databatch_x = self.model.get_databatch_x()
    id_file=[]
    for xbatch in databatch_x:
        mapping = dataset.x_to_index(xbatch)
        id_file.append(mapping)
    index_map = pd.concat(id_file, ignore_index=True) #one big map dataframe

    # add y to map
    index_map['y_orig'] = df['y_orig']
    index_map['y_hat'] = df['y_hat']

    print(df.head())

    # merge with original data
    all_data_preds = pd.merge(index_map, orig_data, on=['idx', 'filename'], how='inner')
    print(all_data_preds.head(50))



I hope this helps someone save time. 

Still looking for a quick way to get a dictionary of how my target is encoded. Inverse transforming didn't seem to work.

dorienh commented 1 year ago

Update, after I get this merged df, I can decode the y labels back from int encoding to their original string by learning my own mapping. Still looking for where this dict is stored somewhere.

    def get_target_mapping(self, df):

        # Initialize an empty mapping dictionary
        mapping_dict = {}

        # Iterate through unique values in 'y'
        for y_value in df['y'].unique():
            # Find the corresponding 'y_hat' value
            corresponding_y_hat = df[df['y'] == y_value]['y_orig'].iloc[0] 
            mapping_dict[y_value] = corresponding_y_hat

        return mapping_dict

    def apply_decode_target(self, df):

        mapping_dict = self.get_target_mapping(df)
        inverted_dict = {value: key for key, value in mapping_dict.items()}
        df['predictions'] = df['y_hat'].map(inverted_dict)

        return df

pcgm-team commented 11 months ago

you can't put those variables in known reals by the way- that is for stuff like time encodings afaik

sktime / pytorch-forecasting

Decode groupIDs and targets #1370