microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.96k stars 222 forks source link

Issues loading 1.5B model in huggingface and in deberta package #29

Closed chessgecko closed 3 years ago

chessgecko commented 3 years ago

Hello,

It seems like some of the weights were renamed/shaped in the V2 model releases and I couldn't quite figure out how to map them to the old structure

# it seemed like 
pos_q_proj => query_proj
v_bias => value_proj

but I couldn't match

deberta.encoder.layer.44.attention.self.key_proj.weight', 'deberta.encoder.layer.44.attention.self.key_proj.bias
=>
deberta.encoder.layer.44.attention.self.q_bias', 'deberta.encoder.layer.44.attention.self.value_proj', 'deberta.encoder.layer.44.attention.self.in_proj.weight', 'deberta.encoder.layer.44.attention.self.pos_proj.weight

That was for huggingface, but I couldn't figure it out in this repo either.

Could someone upload the v2 model file?

BigBird01 commented 3 years ago

v2 is slightly different from v1. The latest v2 code hasn't been integrated with HF transformers yet. To try our v2 model, you need to use our official package at this moment. We will integrate our latest code with HF transformers soon.

chessgecko commented 3 years ago

Really sorry to be annoying about this, but I couldn't quite get it to line up in the code currently on this repo either,

encoder.layer.0.attention.self.q_bias: torch.Size([1536])
encoder.layer.0.attention.self.v_bias: torch.Size([1536])
encoder.layer.0.attention.self.in_proj.weight: torch.Size([4608, 1536])
encoder.layer.0.attention.self.pos_proj.weight: torch.Size([1536, 1536])
encoder.layer.0.attention.self.pos_q_proj.weight: torch.Size([1536, 1536])
encoder.layer.0.attention.self.pos_q_proj.bias: torch.Size([1536])

vs

deberta.encoder.layer.0.attention.self.query_proj.weight: torch.Size([1536, 1536])
deberta.encoder.layer.0.attention.self.query_proj.bias: torch.Size([1536])
deberta.encoder.layer.0.attention.self.key_proj.weight: torch.Size([1536, 1536])
deberta.encoder.layer.0.attention.self.key_proj.bias: torch.Size([1536])
deberta.encoder.layer.0.attention.self.value_proj.weight: torch.Size([1536, 1536])
deberta.encoder.layer.0.attention.self.value_proj.bias: torch.Size([1536])

It seems like they should match, but I wasn't quite sure what went where.

Also, is the code to run the model all the same?

This is with the weights at https://huggingface.co/microsoft/deberta-xxlarge-v2/tree/main

BigBird01 commented 3 years ago

The code is different. Please check https://github.com/microsoft/DeBERTa/blob/penhe/debertav2/DeBERTa/deberta/disentangled_attention.py for the differences.

chessgecko commented 3 years ago

Working for me now, thanks!