Open codybai opened 4 years ago
I run into the same error. I wonder is this solved? Thanks.
I run into the same error. I wonder is this solved? Thanks.
You can downgrade your torch to 1.4.0, which works fine for me (hint: you might have to change your cuda toolkit to lower versions, too).
I confirm with the following env:
name: pt1.4
channels:
- pytorch
- salilab
- conda-forge
- bioconda
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=1_llvm
- ca-certificates=2020.12.5=ha878542_0
- certifi=2020.12.5=py38h578d9bd_1
- cudatoolkit=10.1.243=h036e899_8
- freetype=2.10.4=h0708190_1
- jpeg=9d=h36c2ea0_0
- lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.35.1=hea4e1c9_2
- libblas=3.9.0=8_openblas
- libcblas=3.9.0=8_openblas
- libffi=3.3=h58526e2_2
- libgcc-ng=9.3.0=h2828fa1_18
- libgfortran-ng=9.3.0=hff62375_18
- libgfortran5=9.3.0=hff62375_18
- liblapack=3.9.0=8_openblas
- libopenblas=0.3.12=pthreads_h4812303_1
- libpng=1.6.37=h21135ba_2
- libstdcxx-ng=9.3.0=h6de172a_18
- libtiff=4.2.0=hdc55705_0
- libwebp-base=1.2.0=h7f98852_2
- llvm-openmp=11.1.0=h4bd325d_0
- lz4-c=1.9.3=h9c3ff4c_0
- mkl=2020.4=h726a3e6_304
- ncurses=6.2=h58526e2_4
- ninja=1.10.2=h4bd325d_0
- numpy=1.20.1=py38h18fd61f_0
- olefile=0.46=pyh9f0ad1d_1
- openssl=1.1.1j=h7f98852_0
- pillow=8.1.2=py38ha0e1e83_0
- pip=21.0.1=pyhd8ed1ab_0
- python=3.8.8=hffdb5ce_0_cpython
- python_abi=3.8=1_cp38
- pytorch=1.4.0=py3.8_cuda10.1.243_cudnn7.6.3_0
- readline=8.0=he28a2e2_2
- setuptools=49.6.0=py38h578d9bd_3
- six=1.15.0=pyh9f0ad1d_0
- sqlite=3.35.2=h74cdb3f_0
- tk=8.6.10=h21135ba_1
- torchvision=0.5.0=py38_cu101
- wheel=0.36.2=pyhd3deb0d_0
- xz=5.2.5=h516909a_1
- zlib=1.2.11=h516909a_1010
- zstd=1.4.9=ha95c52a_0
with more than 1 GPU cards (otherwise one will get a dividing by 0 error)
with mem_transformer.py line 754 changed to
loss = self.crit(pred_hid.reshape(-1, pred_hid.size(-1)), target.reshape(-1))
(use reshape
instead of view
)
it can run through for bash run_wt103_base.sh train --work_dir TRAIN_wt103
command.
when running with bash run_wt103_base.sh train --work_dir TRAIN_wt103, the same problem happens to me as well.
The pytorch version is 1.12, gpu is 3090 and cuda version is 11.3.
One solution that works for me is as follows:
(1) define a dumb parameter in the _init_ function of MemTransformerLM class at line 495
self.null = nn.Parameter(torch.tensor(0.0))
(2) replace the init_mems(self) function with
def init_mems(self): if self.mem_len > 0: mems = []
for i in range(self.n_layer+1):
# empty = torch.empty(0, dtype=param.dtype, device=param.device)
empty = torch.empty(0, dtype=self.null.dtype, device=self.null.device)
mems.append(empty)
return mems
else:
return None
(3) change line 754 loss = self.crit(pred_hid.view(-1, pred_hid.size(-1)), target.view(-1)) to loss = self.crit(pred_hid.reshape(-1, pred_hid.size(-1)), target.reshape(-1)).
Note all the changes are made in mem_transformer.py.
when running with bash run_wt103_base.sh train --work_dir TRAIN_wt103, the same problem happens to me as well. The pytorch version is 1.12, gpu is 3090 and cuda version is 11.3. One solution that works for me is as follows: (1) define a dumb parameter in the init function of MemTransformerLM class at line 495
self.null = nn.Parameter(torch.tensor(0.0))
(2) replace the init_mems(self) function withdef init_mems(self): if self.mem_len > 0: mems = [] # param = next(self.parameters()) for i in range(self.n_layer+1): # empty = torch.empty(0, dtype=param.dtype, device=param.device) empty = torch.empty(0, dtype=self.null.dtype, device=self.null.device) mems.append(empty) return mems else: return None
(3) change line 754 loss = self.crit(pred_hid.view(-1, pred_hid.size(-1)), target.view(-1)) to loss = self.crit(pred_hid.reshape(-1, pred_hid.size(-1)), target.reshape(-1)).
Note all the changes are made in mem_transformer.py.
when running with bash run_wt103_base.sh train --work_dir TRAIN_wt103, the same problem happens to me as well. The pytorch version is 1.12, gpu is 3090 and cuda version is 11.3. One solution that works for me is as follows: (1) define a dumb parameter in the init function of MemTransformerLM class at line 495
self.null = nn.Parameter(torch.tensor(0.0))
(2) replace the init_mems(self) function withdef init_mems(self): if self.mem_len > 0: mems = [] # param = next(self.parameters()) for i in range(self.n_layer+1): # empty = torch.empty(0, dtype=param.dtype, device=param.device) empty = torch.empty(0, dtype=self.null.dtype, device=self.null.device) mems.append(empty) return mems else: return None
(3) change line 754 loss = self.crit(pred_hid.view(-1, pred_hid.size(-1)), target.view(-1)) to loss = self.crit(pred_hid.reshape(-1, pred_hid.size(-1)), target.reshape(-1)).
Note all the changes are made in mem_transformer.py.
Thanks a lot for this solution. I also have the same bug when using pytorch=2.0.0. And this solution works well for me.
nn.ParameterList
is used in the code, which seems to be incompatible withnn.DataParallel
. This will cause the replica to be empty.I think this is the problem.