Open himil48 opened 1 year ago
This is hard to dell. How does data
look like when you print it? Do you have a minimal example to reproduce?
data = HeteroData()
# Add patient node features for message passing:
data['patient'].x = torch.eye(len(patient_mapping), device=device)
# Add condition node features
data['condition'].x = condition_x
# Add ratings between users and movies
data['patient', 'has', 'condition'].edge_index = edge_index
data['patient', 'has', 'condition'].edge_label = edge_label
data.to(device, non_blocking=True)
# Train/Test split
data = ToUndirected()(data)
del data['condition', 'rev_has', 'patient'].edge_label # Remove "reverse" label.
Initially it would output
HeteroData(
patient={ x=[1152, 1152] },
condition={ x=[123, 384] }
But now it is outputting
data
HeteroData(
patient={ x=[1152, 1152] },
condition={ x=[123, 384] },
(patient, has, condition)={
edge_index=[2, 6990],
edge_label=[6990]
},
(condition, rev_has, patient)={ edge_index=[2, 6990] }
)
and I no longer receive the AttributeError so I'm not sure what changed. It still however seems to show the reverse label edge despite me deleting it.
Should I specifically say
del data['condition', 'rev_has', 'patient'].edge_index
The reverse edge_index
is added via ToUndirected()
, so I don't see a reason why you would want to remove it afterwards.
Okay, thank you very much
Do you have any suggestions on how I could best perform an analysis to try an understand why my model is predicting links between certain nodes more frequently than others.
These are the normalised counts of the current conditions vs the predicted conditions (number of predictions =3). It seems to be "over predicting" normal pregnancy. The percentage associated with normal pregnancy also increases if I decrease the number of predictions.
This is interesting but hard to tell without further context. What do you mean with number of predictions=3?
Number of predictions = k, means that I only want the top k predicted edges
pred = model(data.x_dict, data.edge_index_dict,
edge_label_index)
pred = pred.clamp(min=0, max=maxvalue)
patient_full_id = reverse_patient_mapping[patient_id]
mask = (pred >=torch.topk(pred,number_of_predictions).values[number_of_predictions-1]).nonzero(as_tuple=True)
Sorry for my delayed response. Sadly, I don't have a good intuition for that. How does your encoder look like? Maybe it is not expressive enough to model the real data distribution?
No problem! Here is the encoder I used.
class GNNEncoder(torch.nn.Module):
def __init__(self, hidden_channels, out_channels):
super().__init__()
self.conv1 = SAGEConv((-1, -1), hidden_channels)
self.conv2 = SAGEConv((-1, -1), out_channels)
def forward(self, x, edge_index):
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index)
return x
Got it. And the decoder (that was a typo in my above message :( )?
class EdgeDecoder(torch.nn.Module):
def __init__(self, hidden_channels):
super().__init__()
self.lin1 = Linear(2 * hidden_channels, hidden_channels)
self.lin2 = Linear(hidden_channels, 1)
def forward(self, z_dict, edge_label_index):
row, col = edge_label_index
z = torch.cat([z_dict['patient'][row], z_dict['condition'][col]], dim=-1)
z = self.lin1(z).relu()
z = self.lin2(z)
return z.view(-1)
To provide a bit of context: I adapted the code that was linked to a medium article in which the author tried to predict links between users and movies as part of a recommendation engine.
That looks actually pretty good, I thought your decoder may be underfitting. How does training and validation performance compare? Are you underfitting/overfitting?
Epoch: 299, Loss: 0.2675, Train: 0.3512, Val: 0.7475, Test: 0.7092
These are the scores pertaining to the last epoch. Based on the initial screenshots which compare the current conditions to the predicted conditions, it seems to be underfitting. What would you say?
It is definitely underfitting :(
Should I perhaps try a different encoder/decoder approach?
First, it would be good to inspect if there is a data distribution shift across training/validation/test. Having such a low training performance while validation and test is high is pretty uncommon, and signals that there is some shift between the sets you created.
My apologies, I forgot to include this
for epoch in range(1, 300):
loss = train()
train_rmse = test(train_data)
val_rmse = test(val_data)
test_rmse = test(test_data)
# val_auc = eval_link_predictor(model, val_data)
series = pd.Series([epoch,loss,train_rmse,val_rmse,test_rmse],index=df_cols)
results =pd.concat([results,series.to_frame().T],ignore_index=True)
print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_rmse:.4f}, '
f'Val: {val_rmse:.4f}, Test: {test_rmse:.4f}')
The values shared above are the RMSE scores.
Ah, it's RMSE :) In that case, it is currently overfitting. Regression values are always hard to interpret. You may find some luck in training against normalized targets. Does that help?
Could you kindly elaborate on the third sentence about normalised targets.
Sure, how is your target distributed? I am not sure if you are already doing that, but usually you want to normalize the target variable, e.g., via
y = (y - y.mean()) / y.std()
where mean
and std
are computing over all labels of your training dataset.
I have responded via the email provided on your GitHub profile :)
🐛 Describe the bug
Hi,
I am receiving the error
AttributeError: 'EdgeStorage' object has no attribute 'edge_index'
when running the code below. I have never received this error before and I have not made any changes to my code during between runs.Do you perhaps know what the problem might be?
Thank you Code
Error
Environment
conda
,pip
, source):torch-scatter
):