huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.22k stars 26.84k forks source link

Hidden states of BertForPreTraining (load from tf ckpt of google original bert) not exactly equal to the output of extract_features.py in google original bert #14221

Closed bengshaoye closed 2 years ago

bengshaoye commented 3 years ago

Environment info

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. load bert-base-chinese official ckpt to pytorch BertForPreTraining
  2. get hidden states of -2 (or -1) layer
  3. run extract_features.py to get -2 (or -1) layer hidden states of original ckpt model with tensorflow
  4. the values of two sides are not exactly the same with some bias

Expected behavior

a pytorch bert load from google ckpt can get the same outputs with the original tensorflow bert

qqaatw commented 3 years ago

Hi,

Would you mind elaborating an approximate range of the bias? Also, is the hidden states output of BertForPreTraining deterministic?

If the output is deterministic but slightly different than TensorFlow's output (the differences are smaller than roughly 1e-5), this is probably a normal behavior due to different BLAS implementations dependent on platform, framework...etc.

See also: https://github.com/pytorch/pytorch/issues/9146#issuecomment-409331986

bengshaoye commented 2 years ago

@qqaatw thank you for answering. Yes, the hidden states output of both BerForPreTraining and google-research bert are deterministic. Sometimes the bias is small, may be 1e-3~1e-4, while using chinese_L-12_H-768_A-12, as showing below: google-research: [0.41173, 0.086385, 0.705549, 0.224586, 0.751009, -1.071174, -0.455632, -0.390582, -0.523216, 0.520333,...] bertforpretraining:[0.411758, 0.0876196, 0.705667, 0.224652, 0.75167, -1., -0.45543, -0.391009, -0.524803, 0.518317,...]

Sometimes the bias is big ,may be 1e-2~1e-1, while using a fine-tuned model from chinese_L-12_H-768_A-12, as showing below: google-research: [0.000858, 0.355273, -0.711266, 0.258692, 1.342211, -0.072978, -0.238096, 0.288613, -0.121792, -0.37079, ...] bertforpretraining:[0.017701, 0.348385, -0.742679, 0.240423, 1.337542, -0.0840113, -0.23040, 0.281977, -0.1528175, -0.3525075, ...]

bengshaoye commented 2 years ago

@qqaatw ps: hidden_states from bertforpretraining were fetched like following: config = BertConfig.from_json_file('d:/workspace/bert-google/chinese_L-12_H-768_A-12/bert_config.json') model = BertForPreTraining.from_pretrained('d:/workspace/bert-google/chinese_L-12_H-768_A-12/bert_model.ckpt',from_tf=True,config=config) tokenizer = BertTokenizerFast.from_pretrained('d:/workspace/bert-google/chinese_L-12_H-768_A-12/') inputs = tokenizer('ηœ‹θ§ζˆ‘η«ιΎ™ζžœδΊ†ε—',return_tensors='pt') outputs = model(**inputs,output_hidden_states=True) print(outputs.hidden_states[-1][0,0,:10].tolist())

And features of each output layer were fetched using extract_features.py post by google-research git repo.

qqaatw commented 2 years ago

Hey @bengshaoye,

config = BertConfig.from_json_file('d:/workspace/bert-google/chinese_L-12_H-768_A-12/bert_config.json')

Could you change the hidden_act from gelu to gelu_new in bert_config.json and try again?

bengshaoye commented 2 years ago

Thanks a lot. gelu_new works fine for BertForPreTraining, now they look the same with gelu in original bert and gelu_new in transformers bert.