Closed marwage closed 3 years ago
Hi there,
The pre-trained models of bert could match all models with sequnce length less than max_position_embeddings
which is 512 by default.
While finetuning with different dataset , you need to change the option seq_length
in the finetuen_eval_config.py
.
https://github.com/mindspore-ai/mindspore/blob/master/model_zoo/official/nlp/bert/src/finetune_eval_config.py#L48
Thank you for your response! It resolved the error.
We have a new error though.
RuntimeError: mindspore/ccsrc/runtime/device/gpu/kernel_info_setter.cc:83 SupportedTypeList] Unsupported op [NPUAllocFloatStatus]
We assume that we get the error because we are trying to run the fine-tuning on a GPU and the pre-training was done with Ascend.
Is this correct and we need a pre-trained model that was trained on a GPU?
Additionally, there is still the issue that the model we have access to is trained in Chinese and Squad is in English.
The pretrained model from mindspore could math all the device target including GPU, NPU, and CPU. There are only weights for parameters in the checkpoint with nothing stick to the device.
NPUAllocFloatStatus
is an operation to check overflow status on Ascend, which shoud not be used while training with GPU.
https://github.com/mindspore-ai/mindspore/blob/68f49c5ff1920ae30680c185d983ba1962cae9b0/model_zoo/official/nlp/bert/src/bert_for_finetune.py#L86
But I found that run_squad.py
use BertSquadCell
for training instead of BertFinetuneCell
, which has not deal with GPU.
Could you please help completing this just following the function in BertFinetuneCell
?
We really appreciate for your contributing.
The latest version 1.2.0
of MindSpore has provide some general methods start_overflow_check
and get_overflow_status
to check overflow, and has been used for pretraining of bert.
https://github.com/mindspore-ai/mindspore/blob/68f49c5ff1920ae30680c185d983ba1962cae9b0/model_zoo/official/nlp/bert/src/bert_for_pre_training.py#L336
You may have a try using this method as well.
We solved the error by adjusting BertSquadCell according to BertFinetuneCell. If you would like to see the difference, head over to https://github.com/kungfu-ml/mindspore-bert/blob/bert/model_zoo/official/nlp/bert/src/bert_for_finetune.py
We are making progress but encountered a new error.
[ERROR] KERNEL(989,python):2021-05-31-08:57:03.687.960 [mindspore/ccsrc/backend/kernel_compiler/gpu/nn/softmax_gpu_kernel.h:222] InitSizeByAxisLastDim] Input is 3-D, but axis(1) is invalid.
Traceback (most recent call last):
File "/home/marcel/Mindspore/bert_mindspore/scripts/../run_squad.py", line 218, in <module>
run_squad()
File "/home/marcel/Mindspore/bert_mindspore/scripts/../run_squad.py", line 183, in run_squad
do_train(ds, netwithloss, load_pretrain_checkpoint_path, save_finetune_checkpoint_path, epoch_num)
File "/home/marcel/Mindspore/bert_mindspore/scripts/../run_squad.py", line 82, in do_train
model.train(epoch_num, dataset, callbacks=callbacks)
File "/home/marcel/Mindspore/p3venv/lib/python3.7/site-packages/mindspore/train/model.py", line 592, in train
sink_size=sink_size)
File "/home/marcel/Mindspore/p3venv/lib/python3.7/site-packages/mindspore/train/model.py", line 391, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
File "/home/marcel/Mindspore/p3venv/lib/python3.7/site-packages/mindspore/train/model.py", line 452, in _train_dataset_sink_process
outputs = self._train_network(*inputs)
File "/home/marcel/Mindspore/p3venv/lib/python3.7/site-packages/mindspore/nn/cell.py", line 331, in __call__
out = self.compile_and_run(*inputs)
File "/home/marcel/Mindspore/p3venv/lib/python3.7/site-packages/mindspore/nn/cell.py", line 588, in compile_and_run
self.compile(*inputs)
File "/home/marcel/Mindspore/p3venv/lib/python3.7/site-packages/mindspore/nn/cell.py", line 575, in compile
_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)
File "/home/marcel/Mindspore/p3venv/lib/python3.7/site-packages/mindspore/common/api.py", line 502, in compile
result = self._executor.compile(obj, args_list, phase, use_vm)
RuntimeError: mindspore/ccsrc/backend/kernel_compiler/gpu/nn/softmax_gpu_kernel.h:222 InitSizeByAxisLastDim] Input is 3-D, but axis(1) is invalid.
Thanks a lot for helping us!
This seems to be an internal error about gpu kernel softmax
. As far as I know, the GPU Softmax
operator with axis other than -1 is under developing.
You may apply Transpose
before Softmax
in the network like this:
self.perm = (0, 2, 1)
self.transpose = P.Transpose()
self.softmax = P.Softmax(axis=-1)
...
...
x_transpose = self.transpose(x, perm)
y_transpose = self.softmax(x_transpose )
y = self.transpose(y_transpose)
As for this situation, I think this softmax
op may be the activation in BertSquadModel
:
https://github.com/mindspore-ai/mindspore/blob/aeaf2d14b35345d79df4fe92ca7d76e79457d5a5/model_zoo/official/nlp/bert/src/finetune_eval_model.py#L74
We modified the code according to your template. It looks like this now: https://github.com/kungfu-ml/mindspore-bert/commit/c35d3327d7f92ce194c9849fd76a1202b30651ce
The error changed slightly.
RuntimeError: mindspore/ccsrc/backend/kernel_compiler/gpu/nn/softmax_grad_gpu_kernel.h:128 Init] Input is 3-D, but softmax grad only supports 2-D inputs.
:joy:
Then maybe you have to reshape the 3-D input into 2-D input using Reshape
or Squeeze
and ExpandDims
, for example:
self.perm = (0, 2, 1)
self.transpose = P.Transpose()
self.softmax = P.Softmax(axis=-1)
...
...
x_transpose = self.transpose(x, self.perm)
x_shape = F.shape(x_transpose)
x_reshape = F.reshape(x_transpose, (x_shape[0] * x_shape[1], -1))
y_reshape = self.softmax(x_reshape)
y_transpose = F.reshape(y_reshape, x_shape)
y = self.transpose(y_transpose, self.perm)
We get the same error when doing the transpose. Note though, that BERT Squad uses log_softmax instead if softmax. Maybe there is an implementation difference. Is there any way that you fix the BertSquadCell and send us the code that is working?
class BertSquadModel(nn.Cell):
'''
This class is responsible for SQuAD
'''
def __init__(self, config, is_training, num_labels=2, dropout_prob=0.0, use_one_hot_embeddings=False):
super(BertSquadModel, self).__init__()
if not is_training:
config.hidden_dropout_prob = 0.0
config.hidden_probs_dropout_prob = 0.0
self.bert = BertModel(config, is_training, use_one_hot_embeddings)
self.weight_init = TruncatedNormal(config.initializer_range)
self.dense1 = nn.Dense(config.hidden_size, num_labels, weight_init=self.weight_init,
has_bias=True).to_float(config.compute_type)
self.num_labels = num_labels
self.dtype = config.dtype
self.log_softmax = P.LogSoftmax(axis=-1)
self.is_training = is_training
def construct(self, input_ids, input_mask, token_type_id):
sequence_output, _, _ = self.bert(input_ids, token_type_id, input_mask)
batch_size, seq_length, hidden_size = P.Shape()(sequence_output)
sequence = P.Reshape()(sequence_output, (-1, hidden_size))
logits = self.dense1(sequence)
logits = P.Cast()(logits, self.dtype)
logits = P.Reshape()(logits, (batch_size, seq_length, self.num_labels))
logits = P.Transpose()(logits, (0, 2, 1))
logits = self.log_softmax(logits)
logits = P.Transpose()(logits, (0, 2, 1))
return logits
I tried this BertSquadModel
on MindSpore 1.2, and got loss sucessfully.
We get the same error with Mindspore v1.2.0.
We needed to modify the file bert_for_finetune.py
so that it runs with a GPU. Did you use a GPU?
Is it possible that you provide us with the data needed in the run_squad.sh
script? Therefore, excluding the possible issue of having the wrong files.
We use the data same as https://deepai.org/dataset/squad1-1-dev
$ md5sum *
3e85deb501d4e538b6bc56f786231552 *dev-v1.1.json
981b29407e0affa3b1b156f72073b945 *train-v1.1.json
@CaitinZhao Could you please provide a fixed branch ready to finetune with GPU?
@Vincent34 and you are using https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/run_squad.py to create the tfrecord file from it? If so then, we use the same data. Additionally, it would mean that we are out of ideas what the issue could be.
@CaitinZhao if you could do that would be very nice
I push my code to gitee, you can download it and have a try. https://gitee.com/zhao_ting_v/mindspore/tree/bert/ @marwage
I push my code to gitee, you can download it and have a try. https://gitee.com/zhao_ting_v/mindspore/tree/bert/ @marwage
You can check this for the modification. https://gitee.com/zhao_ting_v/mindspore/commit/d17b439b2e3dd33c262b5e4fea3ee0131b0163b5
@Vincent34 and you are using https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/run_squad.py to create the tfrecord file from it? If so then, we use the same data. Additionally, it would mean that we are out of ideas what the issue could be.
Yes, I use run_squad.py
from google to create the tfrecord.
Good news! It is working for us. It's not working with version 1.1.0 but with version 1.2.0. We will try to build our project against version 1.2.0. Thank you for your help!
One outstanding issue though is that the pre-trained model is in Chinese. Do you have a pre-trained model in English?
Update
Substituting with your finetune_eval_model.py
let's us run the fine-tuning with version 1.1.0.
Sorry we haven't provide a pre-trained model in English yet.
But I have a checkpoint convertor myself, which could be used to tranfer weights from a google pre-trained model to a MindSpore one. Can this help you?
https://gist.github.com/Vincent34/b1300463453d7433f1dfe9494d5cdf7e
That would work just fine.
ms2tf_config.py
is basically just a dictionary. Do you have the corresponding script that translates the tf model to a ms model?
That would work just fine.
ms2tf_config.py
is basically just a dictionary. Do you have the corresponding script that translates the tf model to a ms model?
You can just pass the argument transfer_option
with value tf2ms
while running ms_and_tf_checkpoint_transfer_tools.py
.
Where can we find the ms_and_tf_checkpoint_transfer_tools.py
script? It does not seem to be part of Mindspore's repository.
Where can we find the
ms_and_tf_checkpoint_transfer_tools.py
script? It does not seem to be part of Mindspore's repository.
It's part of my gist. There are two scripts in
Oh, sorry! I did not see the second file...
Unfortunately, the checkpoint from https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12.tar.gz does not work with your script. The exact error is:
Traceback (most recent call last): File "ms_and_tf_checkpoint_transfer_tools.py", line 129, in <module>
main()
File "ms_and_tf_checkpoint_transfer_tools.py", line 123, in main convert_tf_2_ms(tf_ckpt_path, ms_ckpt_path, new_ckpt_path)
File "ms_and_tf_checkpoint_transfer_tools.py", line 76, in convert_tf_2_ms
data = tf.train.load_variable(tf_ckpt_path, tf_name)
File "/home/marcel/Mindspore/kf-ms-venv/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 85, in load_variable
reader = load_checkpoint(ckpt_dir_or_file)
File "/home/marcel/Mindspore/kf-ms-venv/lib/python3.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 67, in load_checkpoint
"given directory %s" % ckpt_dir_or_file)
ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory /home/marcel/Mindspore/convert_model/uncased_L-24_H-1024_A-16
We looked at the Tensorflow BERT code. We suspect that we need to do something like
with strategy.scope():
# Prediction always uses float32, even if training uses mixed precision.
tf.keras.mixed_precision.set_global_policy('float32')
squad_model, _ = bert_models.squad_model(
¦ bert_config,
¦ input_meta_data['max_seq_length'],
¦ hub_module_url=FLAGS.hub_module_url)
if checkpoint_path is None:
checkpoint_path = tf.train.latest_checkpoint(FLAGS.model_dir)
logging.info('Restoring checkpoints from %s', checkpoint_path)
checkpoint = tf.train.Checkpoint(model=squad_model)
checkpoint.restore(checkpoint_path).expect_partial()
return squad_model
We cannot make it work though. Could you have a look?
It should work like this:
python ms_and_tf_checkpoint_transfer_tools.py \
--tf_ckpt_path=uncased_L-24_H-1024_A-16/bert_model.ckpt \
--ms_ckpt_path=bert_ms.ckpt
--new_ckpt_path=bert_new.ckpt \
--transfer_option=tf2ms
I guess you just passed the directory path, without the checkpoint name. The tf checkpoint should contains 3 files including:
bert_model.ckpt.index
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.meta
whose name is bert_model.ckpt
over the different extral suffixes.
Thank you!
Now the error is
Traceback (most recent call last):
File "ms_and_tf_checkpoint_transfer_tools.py", line 129, in <module>
main()
File "ms_and_tf_checkpoint_transfer_tools.py", line 123, in main convert_tf_2_ms(tf_ckpt_path, ms_ckpt_path, new_ckpt_path)
File "ms_and_tf_checkpoint_transfer_tools.py", line 77, in convert_tf_2_ms ms_shape = ms_param_dict[ms_name].data.shape
KeyError: 'bert.bert.bert_embedding_postprocessor.token_type_embedding.embedding_table'
Is the following Mindspore model wrong? https://download.mindspore.cn/model_zoo/r1.1/bertbase_ascend_v111_zhwiki_offical_nlp_bs256_loss3/bertbase_ascend_v111_zhwiki_offical_nlp_bs256_loss3.7.ckpt
My script is used to convert the checkpoint of version v1.2
, of which the name of weights has little difference with v1.1
.
We switched to the builtin Embedding
since version 1.2
. So the weight name of embedding has changed.
1.1 vs. 1.2
"bert.bert.bert_embedding_postprocessor.embedding_table": "bert.bert.bert_embedding_postprocessor.token_type_embedding.embedding_table
"bert.bert.bert_embedding_postprocessor.full_position_embeddings": "bert.bert.bert_embedding_postprocessor.full_position_embedding.embedding_table"
You could modify the name in ms2tf_config.py
directly, or just run the pretraining job to get a checkpoint from version 1.2. The saved weight doesn't matter, and will be overwrite soon.
Thanks! The conversion worked after also deleting the layers in the dictionary that are not part of BERT-base.
Unfortunately, it seems like there needs happen a transpose at some point.
RuntimeError: Net parameters bert.bert.bert_embedding_lookup.embedding_table shape((30522, 768)) different from parameter_dict's((768, 30522))
That's weried.
Maybe you can fixed it by process a transpose directly after load_checkpoint
in run_squad.py
. And you can use save_checkpoint
to save a fixed checkpoint.
I have figure it out.
Did you just modify the weight name in ms2tf_config.py
, and still passed the pretrained checkpoint from mindspore for zh-wiki as the ms_ckpt_path
?
The embedding shape for zh and en is different, because the vocab_size of zh is 21128, but the vocab_size of en is 30522. That makes the following shape check failed, so there is a superfluous transpose. https://gist.github.com/Vincent34/b1300463453d7433f1dfe9494d5cdf7e#file-ms_and_tf_checkpoint_transfer_tools-py-L81
Maybe you can solve this by add more conditions for that shape check.
Thank you! You gave us the right hint.
We were using a pre-trained Mindspore model.
The trick was to adjust the seq_length
in the file src/config.py
Thank you for all your help! We can close this issue now :)
Hi @marwage , I am working on BERT for Mindspore as well and looking for your findings. Are you able to achieve the accuracy for SQUAD1.1 as mentioned in paper for BERT. Do you think you can share your code?
HI @amanwalia123 , you can find our code at https://github.com/kungfu-ml/mindspore-bert . So far, we were running the experiments only for one epoch. That's why I cannot say something about the accuracy.
That's really helpful that you shared the code. I really appreciate this. If possible, can you share you findings once it is finished?
Hi Mindspore team,
We would like to fine-tune BERT with the Squad v1.1 dataset. Therefore, we run the following script
model_zoo/official/nlp/bert/scripts/run_squad.sh
Unfortunately, we get the following error
ValueError: For 'TensorAdd', the x_shape [32, 384, 768] and y_shape [1, 128, 768] can not broadcast.
To us, it looks like the sequence length of Squad is 384 and the pre-trained model was trained with a sequence length 128. We used the pre-trained model from https://download.mindspore.cn/model_zoo/r1.1/bertbase_ascend_v111_zhwiki_offical_nlp_bs256_loss3/bertbase_ascend_v111_zhwiki_offical_nlp_bs256_loss3.7.ckpt Additionally, the pre-trained model was trained in Chinese whereas Squad is in English.
Could you provide us with a pre-trained BERT model that works with the Squad v1.1 dataset?