Closed nam1410 closed 1 year ago
The code for your forward
call in MB
seems cutoff. What are you actually outputting from the mode?
I'm considering only the attention scores and ignoring the logits or the predicted probabilities
not really sure why that would make sense - if you have ground truth score for each patch, you don't need MIL/CLAM. you can just do supervised regression between each patch and its groundtruth score.
In the training loop, I'm forcing the CLAM attention model to produce attention scores closer to the ground truth. Hence, the MSE loss.
I tried overfitting the model with one gigapixel image with 30000 patches. Upon closer investigation, I noticed that the weights aren't updated during the training at all. Keeping the above reference code, I checked the model parameters as follows and it's printing True for every epoch.
a = list(model.parameters())[0].clone()
l2_loss.backward()
optimizer.step()
b = list(model.parameters())[0].clone()
print(torch.equal(a.data, b.data))
print(list(model.parameters())[0].grad)
optimizer.zero_grad()
Output:
True
None
It seems like the weights and grads aren't updating at all. Do you have any advice @fedshyvana ?
something seems off. did you accidentally freeze the model weights? can you check if requires_grad = True for your params. can't really think of anything else.
a = list(model.parameters())[0].clone()
l2_loss.backward()
optimizer.step()
b = list(model.parameters())[0].clone()
print(torch.equal(a.data, b.data))
print(list(model.parameters())[0].grad)
for name, param in model.named_parameters():
print(name, param.grad, param.requires_grad)
optimizer.zero_grad()
Output:
True
None
attention_net.0.attention_a.0.weight None True
attention_net.0.attention_a.0.bias None True
attention_net.0.attention_a.1.weight None True
attention_net.0.attention_a.1.bias None True
attention_net.0.attention_b.0.weight None True
attention_net.0.attention_b.0.bias None True
attention_net.0.attention_b.1.weight None True
attention_net.0.attention_b.1.bias None True
attention_net.0.attention_c.weight None True
attention_net.0.attention_c.bias None True
@fedshyvana [UPDATE]: I caught the problem in the training process. The computational graph was broken in the training loop. This was hampering the gradient propagation, and I solved that issue. Sometimes, an unforeseen and seemingly trivial bug is annoying.
Hi, I am training a simple attention network with stored extracted ResNet features. Every gigapixel image is divided into approximately 20000 patches of size 256x256, and each patch is associated with a feature vector from custom ResNet50. Now, my shape of data for every image will be [20000, 1024]. The train data loader loads a gigapixel image at a time, making the batch size 1. Model
Utils
Training
The train loss oscillates and gets stuck within a fixed range of values as follows and does not minimize at all:
NOTE: I have tried the above experiments for Learning rates ranging from 1e-2 to 1e-6; Weight decay from 1e-3 to 1e-6; for optimizers both Adam and SGD; epochs from 50 to 200 (with and without early stopping). The loss graph for all the experiments conducted so far is similar to the above snapshot.
Any help is appreciated @fedshyvana.