Closed BramVanroy closed 5 years ago
i think you are correct.
In this section we evaluate how well BERT
performs in the feature-based approach by
generating ELMo-like pre-trained contextual
representations on the CoNLL-2003 NER
task. To do this, we use the same input
representation as in Section 4.3, but use the
activations from one or more layers without
fine-tuning any parameters of BERT. These
contextual embeddings are used as input to
a randomly initialized two-layer 768-
dimensional BiL-STM before the
classification layer.
for large model, bert output = [batch_size, seq_length, 4*hidden_size] lstm layer : 2*384 (bidirectional) softmax projection : 11(9 + with ‘CLS’, X’ tag)
@dsindex I think so, too. Closing.
In your paper (section 5.4, table 7) you indicate that concatenating the last four layers gave the best performance but details are scarce. I am not sure how to see this, i.e. over which axis the concatenation would take place.
Assume the output is of a layer is
batch_size, seq_length, hidden
. For a batch of one sentence with 8 tokens, that would be1, 8, 1024
. When you concatenate them over a default dim=0 that would lead tobatch_size*layers, seq_length, hidden
, where layers is the amount of layers that you concatenate. But concatenating over the first dimensions doesn't seem to make sense. So my question is, on which dimension do you concatenate? My guess would be the last one, leaving you withbatch_size, seq_length, hidden*layers
. Is this correct?
Hi ,Can you tell me how you tried to concatenate the layers, I tried the Hanxiao’s visualization file but it seems that there are problems with servers, I would like to visualize the best layer. Thanx
I have no experience with Tensorflow and it'll depend on your implementation, but using transformers
with PyTorch, you can do something like
out = model(...)
hidden_states = out[2]
cat = torch.cat([hidden_states[i] for i in [-1,-2,-3,-4]], dim=-1)
In your paper (section 5.4, table 7) you indicate that concatenating the last four layers gave the best performance but details are scarce. I am not sure how to see this, i.e. over which axis the concatenation would take place.
Assume the output is of a layer is
batch_size, seq_length, hidden
. For a batch of one sentence with 8 tokens, that would be1, 8, 1024
. When you concatenate them over a default dim=0 that would lead tobatch_size*layers, seq_length, hidden
, where layers is the amount of layers that you concatenate. But concatenating over the first dimensions doesn't seem to make sense. So my question is, on which dimension do you concatenate? My guess would be the last one, leaving you withbatch_size, seq_length, hidden*layers
. Is this correct?