google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.04k stars 9.59k forks source link

Best performance on concatenated layers: which dimension? #511

Closed BramVanroy closed 5 years ago

BramVanroy commented 5 years ago

In your paper (section 5.4, table 7) you indicate that concatenating the last four layers gave the best performance but details are scarce. I am not sure how to see this, i.e. over which axis the concatenation would take place.

Assume the output is of a layer is batch_size, seq_length, hidden. For a batch of one sentence with 8 tokens, that would be 1, 8, 1024. When you concatenate them over a default dim=0 that would lead to batch_size*layers, seq_length, hidden, where layers is the amount of layers that you concatenate. But concatenating over the first dimensions doesn't seem to make sense. So my question is, on which dimension do you concatenate? My guess would be the last one, leaving you with batch_size, seq_length, hidden*layers. Is this correct?

dsindex commented 5 years ago

i think you are correct.

In this section we evaluate how well BERT 
performs in the feature-based approach by 
generating ELMo-like pre-trained contextual 
representations on the CoNLL-2003 NER 
task. To do this, we use the same input 
representation as in Section 4.3, but use the 
activations from one or more layers without 
fine-tuning any parameters of BERT. These 
contextual embeddings are used as input to 
a randomly initialized two-layer 768-
dimensional BiL-STM before the 
classification layer.

for large model, bert output = [batch_size, seq_length, 4*hidden_size] lstm layer : 2*384 (bidirectional) softmax projection : 11(9 + with ‘CLS’, X’ tag)

BramVanroy commented 5 years ago

@dsindex I think so, too. Closing.

SlimenBouras commented 4 years ago

In your paper (section 5.4, table 7) you indicate that concatenating the last four layers gave the best performance but details are scarce. I am not sure how to see this, i.e. over which axis the concatenation would take place.

Assume the output is of a layer is batch_size, seq_length, hidden. For a batch of one sentence with 8 tokens, that would be 1, 8, 1024. When you concatenate them over a default dim=0 that would lead to batch_size*layers, seq_length, hidden, where layers is the amount of layers that you concatenate. But concatenating over the first dimensions doesn't seem to make sense. So my question is, on which dimension do you concatenate? My guess would be the last one, leaving you with batch_size, seq_length, hidden*layers. Is this correct?

Hi ,Can you tell me how you tried to concatenate the layers, I tried the Hanxiao’s visualization file but it seems that there are problems with servers, I would like to visualize the best layer. Thanx

BramVanroy commented 4 years ago

I have no experience with Tensorflow and it'll depend on your implementation, but using transformers with PyTorch, you can do something like

out = model(...)
hidden_states = out[2]
cat = torch.cat([hidden_states[i] for i in [-1,-2,-3,-4]], dim=-1)