Hidden states of BertForPreTraining (load from tf ckpt of google original bert) not exactly equal to the output of extract_features.py in google original bert

bengshaoye commented 3 years ago

Environment info

transformers version:
Platform: ubuntu 20.04
Python version: 3.8
PyTorch version (GPU?): 4.10.2
Tensorflow version (GPU?): 2.4.0
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[ ] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

load bert-base-chinese official ckpt to pytorch BertForPreTraining
get hidden states of -2 (or -1) layer
run extract_features.py to get -2 (or -1) layer hidden states of original ckpt model with tensorflow
the values of two sides are not exactly the same with some bias

Expected behavior

a pytorch bert load from google ckpt can get the same outputs with the original tensorflow bert

qqaatw commented 3 years ago

Hi,

Would you mind elaborating an approximate range of the bias? Also, is the hidden states output of BertForPreTraining deterministic?

If the output is deterministic but slightly different than TensorFlow's output (the differences are smaller than roughly 1e-5), this is probably a normal behavior due to different BLAS implementations dependent on platform, framework...etc.

bengshaoye commented 2 years ago

@qqaatw thank you for answering. Yes, the hidden states output of both BerForPreTraining and google-research bert are deterministic. Sometimes the bias is small, may be 1e-3~1e-4, while using chinese_L-12_H-768_A-12, as showing below: google-research: [0.41173, 0.086385, 0.705549, 0.224586, 0.751009, -1.071174, -0.455632, -0.390582, -0.523216, 0.520333,...] bertforpretraining:[0.411758, 0.0876196, 0.705667, 0.224652, 0.75167, -1., -0.45543, -0.391009, -0.524803, 0.518317,...]

Sometimes the bias is big ,may be 1e-2~1e-1, while using a fine-tuned model from chinese_L-12_H-768_A-12, as showing below: google-research: [0.000858, 0.355273, -0.711266, 0.258692, 1.342211, -0.072978, -0.238096, 0.288613, -0.121792, -0.37079, ...] bertforpretraining:[0.017701, 0.348385, -0.742679, 0.240423, 1.337542, -0.0840113, -0.23040, 0.281977, -0.1528175, -0.3525075, ...]

bengshaoye commented 2 years ago

@qqaatw ps: hidden_states from bertforpretraining were fetched like following: config = BertConfig.from_json_file('d:/workspace/bert-google/chinese_L-12_H-768_A-12/bert_config.json') model = BertForPreTraining.from_pretrained('d:/workspace/bert-google/chinese_L-12_H-768_A-12/bert_model.ckpt',from_tf=True,config=config) tokenizer = BertTokenizerFast.from_pretrained('d:/workspace/bert-google/chinese_L-12_H-768_A-12/') inputs = tokenizer('看见我火龙果了吗',return_tensors='pt') outputs = model(**inputs,output_hidden_states=True) print(outputs.hidden_states[-1][0,0,:10].tolist())

And features of each output layer were fetched using extract_features.py post by google-research git repo.

qqaatw commented 2 years ago

Hey @bengshaoye,

config = BertConfig.from_json_file('d:/workspace/bert-google/chinese_L-12_H-768_A-12/bert_config.json')

Could you change the hidden_act from gelu to gelu_new in bert_config.json and try again?

bengshaoye commented 2 years ago

Thanks a lot. gelu_new works fine for BertForPreTraining, now they look the same with gelu in original bert and gelu_new in transformers bert.

huggingface / transformers