Open cahya-wirawan opened 4 years ago
I have some other problems to run the notebook CLS-DE.ipynb. If I use conda and install the default pytorch (1.3.1), after the command
exp.finetune_lm.train_(cls_dataset, num_epochs=20)
I get following error message:
ImportError: /tmp/torch_extensions/forget_mult_cuda/forget_mult_cuda.so: undefined symbol: _ZN3c106Symbol14fromQualStringERKSs
Then I installed pytroch from the pytorch channel as follow:
conda install pytorch=1.3.1 torchvision cudatoolkit=10.0 -c pytorch
The issue with "undefined symbol" is gone, but the kernel was restarted during the first epoch of exp.finetune_lm.train_(cls_dataset, num_epochs=20)
Is this known problem? Following is maybe the relevan python modules:
$ conda list| egrep 'torch|^fastai|cuda|nvid' _pytorch_select 0.2 gpu_0 cudatoolkit 10.0.130 0 cudnn 7.6.5 cuda10.0_0 fastai 1.0.61 1 fastai nvidia-ml-py3 7.352.0 py_0 fastai pytorch 1.3.1 cuda100py37h53c1284_0 torchvision 0.4.2 cuda100py37hecfc37a_0
Thanks.
I fixed the kernel restarting after I use CUDA 9.2 instead of CUDA 10.0. It seems the model doesn't like the latest cuda version. Now the notebook runs properly to the end.
I have some other problems to run the notebook CLS-DE.ipynb. If I use conda and install the default pytorch (1.3.1), after the command
exp.finetune_lm.train_(cls_dataset, num_epochs=20)
I get following error message:
ImportError: /tmp/torch_extensions/forget_mult_cuda/forget_mult_cuda.so: undefined symbol: _ZN3c106Symbol14fromQualStringERKSs
Then I installed pytroch from the pytorch channel as follow:
conda install pytorch=1.3.1 torchvision cudatoolkit=10.0 -c pytorch
The issue with "undefined symbol" is gone, but the kernel was restarted during the first epoch of
exp.finetune_lm.train_(cls_dataset, num_epochs=20)
Is this known problem? Following is maybe the relevan python modules:
Thanks.