k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
931 stars 295 forks source link

Error while decoding #1288

Closed SSwethaSel0609 closed 1 year ago

SSwethaSel0609 commented 1 year ago

2023-10-02 04:44:42,617 INFO [zipformer.py:178] At encoder stack 4, which has downsampling_factor=2, we will combine the outputs of layers 1 and 3, with downsampling_factors=2 and 8. 2023-10-02 04:44:42,626 INFO [decode.py:917] Calculating the averaged model over epoch range from 5 (excluded) to 30 Traceback (most recent call last): File "./pruned_transducer_stateless7/decode.py", line 1015, in main() File "/mnt/efs/swetha/icefall_env_swe/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "./pruned_transducer_stateless7/decode.py", line 922, in main model.load_state_dict( File "/mnt/efs/swetha/icefall_env_swe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1667, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Transducer: size mismatch for decoder.embedding.weight: copying a param with shape torch.Size([500, 512]) from checkpoint, the shape in current model is torch.Size([250, 512]). size mismatch for joiner.output_linear.weight: copying a param with shape torch.Size([500, 512]) from checkpoint, the shape in current model is torch.Size([250, 512]). size mismatch for joiner.output_linear.bias: copying a param with shape torch.Size([500]) from checkpoint, the shape in current model is torch.Size([250]). size mismatch for simple_am_proj.weight: copying a param with shape torch.Size([500, 384]) from checkpoint, the shape in current model is torch.Size([250, 384]). size mismatch for simple_am_proj.bias: copying a param with shape torch.Size([500]) from checkpoint, the shape in current model is torch.Size([250]). size mismatch for simple_lm_proj.weight: copying a param with shape torch.Size([500, 512]) from checkpoint, the shape in current model is torch.Size([250, 512]). size mismatch for simple_lm_proj.bias: copying a param with shape torch.Size([500]) from checkpoint, the shape in current model is torch.Size([250]).

csukuangfj commented 1 year ago

Please show your complete command.

The error indicates you use a different set of model arguments for decode.py and train.py. Please check the commandline arguments carefully.

SSwethaSel0609 commented 1 year ago

Yeah I resolved that error.. I'm getting very high word error rate.. what should I do to reduce

csukuangfj commented 1 year ago

Are you using your own model or use our pre-trained model? If you are using your own model, is your model converged?

SSwethaSel0609 commented 1 year ago

No I'm using my own model.. I'm training the model again.. it was running before after i changed the gpu.. it is showing error like File "/usr/lib/python3.8/shutil.py", line 675, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/usr/lib/python3.8/shutil.py", line 673, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) OSError: [Errno 16] Device or resource busy: '.nfsf8aa902a3d86d53e00000bf6' Traceback (most recent call last): File "./pruned_transducer_stateless7/train.py", line 1275, in main() File "./pruned_transducer_stateless7/train.py", line 1266, in main mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True) File "/mnt/efs/swetha/icefall_env_swe/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/mnt/efs/swetha/icefall_env_swe/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/mnt/efs/swetha/icefall_env_swe/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/mnt/efs/swetha/icefall_env_swe/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/mnt/efs/swetha/marathi/ds-icefall-scripts/pruned_transducer_stateless7/train.py", line 1145, in run train_one_epoch( File "/mnt/efs/swetha/marathi/ds-icefall-scripts/pruned_transducer_stateless7/train.py", line 940, in train_one_epoch valid_info = compute_validation_loss( File "/mnt/efs/swetha/marathi/ds-icefall-scripts/pruned_transducer_stateless7/train.py", line 753, in compute_validation_loss loss, loss_info = compute_loss( File "/mnt/efs/swetha/marathi/ds-icefall-scripts/pruned_transducer_stateless7/train.py", line 704, in compute_loss raise ValueError( ValueError: There are too many utterances in this batch leading to inf or nan losses.

desh2608 commented 1 year ago

Duplicate of https://github.com/k2-fsa/icefall/issues/1289