I'm learning to use the library by writing a DEQ layer inside GPT-2. After loss.backward() some model parameters do not get gradients. I tracked the issue and it turns out that, the returned fixed point solution z_out[-1] by the DEQ layer (which is a tensor of shape (batch size, sequence length, hidden dimension)) has requires_grad set to False. Strangely, if I use a freshly-initialized GPT-2 model instead (without the pretrained weights), the issue is gone.
where it seems that the model does converge in 9 iterations but the returned hidden states have requires_grad=False, so the model's parameters (besides those after the DEQ layer) do not get gradients. I tried manually setting z_out[-1].requires_grad=True but this doesn't help; after loss.backward() the .grad is still None for those parameters.
Intriguingly, if I use a freshly-initialized GPT-2 then the issue seems to go away:
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = GPT2Model.from_pretrained(model_name, attn_pdrop=0.0, embd_pdrop=0.0, resid_pdrop=0.0, summary_first_dropout=0.0)
# initialize the weights randomly
config = model.config
model = GPT2Model(config)
batch = ["we", "we"]
inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
inputs = {key: value.to(device) for key, value in inputs.items()}
model.to(device)
outputs = model(**inputs, use_cache=False)
where it can be seen that requires_grad becomes True now. Also, apart from requires_grad, it seems the sradius in info is -1 now instead of 0, which I'm not sure if it's related here.
I wonder if you have ideas about why this happens. Would appreciate it!
Thanks for the great library!
I'm learning to use the library by writing a DEQ layer inside GPT-2. After
loss.backward()
some model parameters do not get gradients. I tracked the issue and it turns out that, the returned fixed point solutionz_out[-1]
by the DEQ layer (which is a tensor of shape(batch size, sequence length, hidden dimension)
) hasrequires_grad
set toFalse
. Strangely, if I use a freshly-initialized GPT-2 model instead (without the pretrained weights), the issue is gone.Specifically, this is my deq setup:
if I print the
requires_grad
of the solution in each iteration before and after the implicit function, such as:and with the main scripts as
Then I will get
where it seems that the model does converge in 9 iterations but the returned hidden states have requires_grad=False, so the model's parameters (besides those after the DEQ layer) do not get gradients. I tried manually setting
z_out[-1].requires_grad=True
but this doesn't help; after loss.backward() the.grad
is stillNone
for those parameters.Intriguingly, if I use a freshly-initialized GPT-2 then the issue seems to go away:
and I get
where it can be seen that requires_grad becomes True now. Also, apart from requires_grad, it seems the
sradius
ininfo
is -1 now instead of 0, which I'm not sure if it's related here.I wonder if you have ideas about why this happens. Would appreciate it!