Question: loss in padding sequence

kyzhouhzau / BERT-NER

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

MIT License

1.23k stars 335 forks source link

Question: loss in padding sequence #3

Open hyli666 opened 5 years ago

hyli666 commented 5 years ago

Hi! Thank you so much for releasing BERT-NER codes. I have a question: the sentences are padded with max_sequence_length and labels in such padded part are set to 0. But when counting the total loss in your code, I found both input and padded parts are included: one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) loss = tf.reduce_sum(per_example_loss)

I am wondering Is it reasonable. Should we mask the padded part?

kyzhouhzau commented 5 years ago

@hyli666 Thank your . you are right, its necessary to do masking, if not the total loss will be influenced and the calculation speed will drop. But i think such influences are limited, and won’t lead the results into a worst direction,besides, i filter it when i evaluate the result F-score. I will change the code, when i am free. Thank you so much!

owen-chenhn commented 5 years ago

@kyzhouhzau Hello! Thanks for providing the code to use Bert in NER task. It really helps me a lot! I have some similar questions about both loss calculation and the size of labels mentioned in this issue #4 . Can you please explain some detail about masking method? I have no idea about how to calculate loss that eliminates the influence of padded part (though I did read the annotated part in your code about masking loss calculation) . Thank you very much!