google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.95k stars 9.57k forks source link

where is the classifier after fine-tuning bert? #253

Open rxy1212 opened 5 years ago

rxy1212 commented 5 years ago

I just fine-tuning bert with a classification task, and i noticed that a classifier is append after bert's output when fine-tuning.

create_model(...) in run_classifier.py

...
with tf.variable_scope("loss"):
    if is_training:
      # I.e., 0.1 dropout
      output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

    logits = tf.matmul(output_layer, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    probabilities = tf.nn.softmax(logits, axis=-1)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)
...

The fine-tuned model can predict my test set and the result is like "probability of 0, probability of 1"(a sentence similarity task). But what confused me is when I feed the fine-tuned model a sentence(not run run_classifier.py but run the model by bert-as-servicehttps://github.com/hanxiao/bert-as-service i got a vector of 768. It's that mean we just use the classifier to fune-tuning bert, but when the fune-tuning is done the classifier will be discard? So we must train another classifier to do our own task like sentence similarity?

astariul commented 5 years ago

Between the bidirectional tranformer of BERT and the output, there is a classification layer. You need this layer for fine-tuning, because the output should be a class, not embeddings (the transformer outputs embeddings).

BERT-as-service gives you a sentence embedding, so of course they removed the classification layer in order to retrieve the embeddings. That's why you have a 768-dimensions vector : that is the sentence embedding.

when the fune-tuning is done the classifier will be discard?

Yes.

So we must train another classifier to do our own task like sentence similarity?

Yes, because what BERT-as-service gives you is an embedding. From this, you need to add a classification layer to fit your data.

rxy1212 commented 5 years ago

Between the bidirectional tranformer of BERT and the output, there is a classification layer. You need this layer for fine-tuning, because the output should be a class, not embeddings (the transformer outputs embeddings).

BERT-as-service gives you a sentence embedding, so of course they removed the classification layer in order to retrieve the embeddings. That's why you have a 768-dimensions vector : that is the sentence embedding.

when the fune-tuning is done the classifier will be discard?

Yes.

So we must train another classifier to do our own task like sentence similarity?

Yes, because what BERT-as-service gives you is an embedding. From this, you need to add a classification layer to fit your data.

That's make sense, thank you!

sangbraj commented 5 years ago

can you please let me know the steps followed/ parameters sent for fine tuning. Should i have to just call the run_classifier's create_model() or just call run_classifier

bayou3 commented 5 years ago

Hello, I saw the cross-entropy loss way when you get 2 variables of log_probs, one_hot_labels, then per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) But can I ask that how to change it to hinge loss ?