lucidrains / x-transformers

A simple but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.42k stars 377 forks source link

Random lack of gradients #256

Closed Baran-phys closed 1 month ago

Baran-phys commented 1 month ago

@lucidrains While training a model, I monitored my gradients, and randomly I get no gradients: Can dropout causes this: 0%| | 44/41765 [00:06<1:29:57, 7.73it/s] 0%| | 45/41765 [00:06<1:31:31, 7.60it/s] 0%| | 46/41765 [00:06<1:32:11, 7.54it/s] All modules and their parameters have gradients. All modules and their parameters have gradients. All modules and their parameters have gradients. All modules and their parameters have gradients. Modules with no gradients: Module: trans_1.attn_layers.layers.0.1.to_out Parameter: weight Module: trans_1.attn_layers.rel_pos.mlp.0.0 Parameter: weight Parameter: bias Module: trans_1.attn_layers.layers.0.1.to_q Parameter: weight Module: trans_1.attn_layers.layers.0.1.to_k Parameter: weight Module: trans_1.attn_layers.layers.0.1.to_v_gate Parameter: weight Parameter: bias Module: trans_1.attn_layers.rel_pos.mlp.2 Parameter: weight Parameter: bias Module: trans_1.attn_layers.rel_pos.mlp.1.0 Parameter: weight Parameter: bias Module: trans_1.attn_layers.layers.0.1.to_v Parameter: weight Module: trans_1.attn_layers.layers.0.0.0 Parameter: g All modules and their parameters have gradients. Modules with no gradients: Module: trans_1.attn_layers.layers.0.1.to_out Parameter: weight Module: trans_1.attn_layers.rel_pos.mlp.0.0 Parameter: weight Parameter: bias

For example, this is the x-transformer part of my code: self.trans_1 = ContinuousTransformerWrapper( dim_in = 64, dim_out = 64, max_seq_len = 1500, emb_dropout = 0.1, use_abs_pos_emb = False, num_memory_tokens = 1, attn_layers = Encoder( dim = 256, depth = 1, heads = 4, rel_pos_bias = True, attn_gate_values = True, use_rmsnorm = True, layer_dropout = 0.1, attn_dropout = 0.1, ff_glu = True, ff_dropout = 0.1, ) )

I monitored the loss at different stages, there are no NaN or inf in it. This is the function that is looking at the gradients:

    `def find_no_grad_modules(m: nn.Module) -> None:
no_grad_params = {n: [] for n, _ in m.named_modules()}
no_grad_modules = set()

for module_name, module in m.named_modules():
    for param_name, param in module.named_parameters(recurse=False):
        full_name = f"{module_name}.{param_name}" if module_name else param_name
        if param.grad is None:
            no_grad_params[module_name].append(param_name)
            no_grad_modules.add(module_name)

if no_grad_modules:
    print("Modules with no gradients:")
    for module_name in no_grad_modules:
        print(f"Module: {module_name}")
        for param_name in no_grad_params[module_name]:
            print(f"  Parameter: {param_name}")
else:
    print("All modules and their parameters have gradients.")`

Is this behaviour healthy? If not, what do you think is causing it?

lucidrains commented 1 month ago

yup that is caused by the layer dropout from stochastic depth technique