mlcommons / training_results_v1.0

This repository contains the results and code for the MLPerf™ Training v1.0 benchmark.
https://mlcommons.org/en/training-normal-10/
Apache License 2.0
37 stars 43 forks source link

Checkpoint to ".PT" conversion fail #7

Open nikhildurgam95 opened 2 years ago

nikhildurgam95 commented 2 years ago

Hello all,

I want to convert the checkpoint tensorflow model to a pytorch model, in a .pt format. I ran :

python convert_tf_checkpoint.py --tf_checkpoint /path/to/model.ckpt-xxxxx.index --bert_config_path /path/to/bert_config.json --output_checkpoint /path/to/model_out.pt

I am getting a undefined symbol error which is preventing me from doing the conversion.

ImportError: /home/user/.virtualenvs/ai/lib/python3.6/site-packages/fast_self_multihead_attn.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at4cuda4blas19_cublasGetErrorEnumE14cublasStatus_t

Is there any solution for this?

Full stack traceback :

Traceback (most recent call last): File "convert_tf_checkpoint.py", line 17, in <module> from modeling import BertForPretraining, BertConfig File "/desktop/user/bert/dell_pytorch_BERT/pytorch/modeling.py", line 37, in <module> from apex.contrib.multihead_attn import SelfMultiheadAttn File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/__init__.py", line 1, in <module> from .self_multihead_attn import SelfMultiheadAttn File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/self_multihead_attn.py", line 9, in <module> from .fast_self_multihead_attn_func import fast_self_attn_func File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/fast_self_multihead_attn_func.py", line 2, in <module> import fast_self_multihead_attn ImportError: /home/user/.virtualenvs/ai/lib/python3.6/site-packages/fast_self_multihead_attn.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at4cuda4blas19_cublasGetErrorEnumE14cublasStatus_t

Kindly provide details if someone has encountered a similar issue and were able to resolve.

Thank you.