Closed nashid closed 2 years ago
@jiang719 we get the following error while training:
/XXXXXXX/python3.8/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "src/trainer/gpt_conut_trainer.py", line 247, in <module>
trainer.train(model_id, epochs, hyper_parameter, save_dir=os.path.abspath(os.path.join(GPT_CONUT_TRAINER_DIR, '..', '..', 'data/models/')))
File "src/trainer/gpt_conut_trainer.py", line 213, in train
self.validate_and_save(model_id, save_dir)
File "src/trainer/gpt_conut_trainer.py", line 125, in validate_and_save
torch.save(checkpoint, save_dir + '/' + 'gpt_conut_' + str(model_id) + '.pt')
File "/XXXXXXX/lib/python3.8/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/XXXXXXX/lib/python3.8/site-packages/torch/serialization.py", line 601, in _save
storage = storage.cpu()
File "/XXXXXXX/lib/python3.8/site-packages/torch/storage.py", line 112, in cpu
return torch._UntypedStorage(self.size()).copy_(self, False)
RuntimeError: CUDA error: device-side assert triggered
What cuda version have you used?
We have tried on both CUDA 10.0 and CUDA 11.3 and they both work.
The dependency listed does not specify the CUDA version. What CUDA version did you use?