Open dorienh opened 1 year ago
I found the magic function dataset.x_to_index()
, which allows me to get the idx and group from x.
I am still struggling to find a very simple map for my target categorical variable. In batch.y it is 0 1 2, whereas in the original dat it's a string (Buy/Sell/None). It seems like there should be a simple mapping saved somewhere...
For the record here is how I merge my original dataframe with predictions:
def test_step(self, batch, batch_idx):
x, y = batch #not sure if instead should be batch[0], batch[1]
y_hat = self(x)
loss = self.criterion(y_hat, y[0].squeeze(1))
self.log('test_loss', loss, batch_size=self.batch_size)
# convert to predictions
preds = torch.argmax(y_hat, axis = 1)
self.test_step_y_hats.append(preds)
self.test_step_ys.append(y[0].squeeze(1))
self.test_step_xs.append(x["encoder_cont"])
# save predictions for reconstruction
for i in range(0, len(x["encoder_cont"])):
row = [x['decoder_target'][i][0].tolist(), preds[i].tolist()]
self.datarow.append(row)
self.databatch_x.append(x)
Then in my main function I initiate the test phase and call this and merge with:
```
self.trainer.test(model = self.model, dataloaders=dataloader)
# get x, y, y_hat, mapping
datarows = self.model.get_datarows()
print('Number of datarows: ' + str(len(datarows)))
df = pd.DataFrame(datarows, columns=['y_orig', 'y_hat'])
# get mapping to groups for y
databatch_x = self.model.get_databatch_x()
id_file=[]
for xbatch in databatch_x:
mapping = dataset.x_to_index(xbatch)
id_file.append(mapping)
index_map = pd.concat(id_file, ignore_index=True) #one big map dataframe
# add y to map
index_map['y_orig'] = df['y_orig']
index_map['y_hat'] = df['y_hat']
print(df.head())
# merge with original data
all_data_preds = pd.merge(index_map, orig_data, on=['idx', 'filename'], how='inner')
print(all_data_preds.head(50))
I hope this helps someone save time.
Still looking for a quick way to get a dictionary of how my target is encoded. Inverse transforming didn't seem to work.
Update, after I get this merged df, I can decode the y labels back from int encoding to their original string by learning my own mapping. Still looking for where this dict is stored somewhere.
def get_target_mapping(self, df):
# Initialize an empty mapping dictionary
mapping_dict = {}
# Iterate through unique values in 'y'
for y_value in df['y'].unique():
# Find the corresponding 'y_hat' value
corresponding_y_hat = df[df['y'] == y_value]['y_orig'].iloc[0]
mapping_dict[y_value] = corresponding_y_hat
return mapping_dict
def apply_decode_target(self, df):
mapping_dict = self.get_target_mapping(df)
inverted_dict = {value: key for key, value in mapping_dict.items()}
df['predictions'] = df['y_hat'].map(inverted_dict)
return df
you can't put those variables in known reals by the way- that is for stuff like time encodings afaik
I use the column filename to indicate my different timeseries (see loader code at the bottom). During testing, I am trying to match predictions with predictions with the original dataset. All is going ok, except, the filename is now an integer, not a filename. How do I decode the original filename from the batch['x']['groups']?
Is groups not the info to get? And how to decode? Would that be in the
dataset.get_parameters()
?Later on after all steps are completed, I convert to dataframe:
df = pd.DataFrame(self.datarows, columns=['idx', 'group', 'y', 'y_hat'])
Then I can match
y_hats
per row based on matchingidx
andgroup
in the original dataset.Am I missing an easier way to do this?
This is my TimeSeries init: