Open trangtv57 opened 2 years ago
Would need more details, e.g. error messages.
I adding log from file log/compute_sentence_scores.1.log Note: I have change version compute_sentence_scores.py of orgin version to version with compute score by cuda. Some addtional info I print after line https://github.com/kaldi-asr/kaldi/blob/12a2092c887c49fce04360dbf48e43067992e770/egs/wsj/s5/steps/pytorchnn/compute_sentence_scores.py#L192
` data shape: torch.Size([38, 1153]) target shape: torch.Size([43814]) seq lens shape: torch.Size([1153])
_Traceback (most recent call last):
File "steps/pytorchnn/compute_sentence_scores_cuda.py", line 345, in
As message log error, I think because the utterance have so much arc path: 1183, so my GPU don't enough space for computing this batchsize. So solution I think is reduce number arc path in lattice-expand to limit number. I know we can change by epsilon para. But When I try to change, it's can fix absolutely problem. Tks.
It looks to me like the sentence length must be 1153. That is a very long sentence; that might be the issue. Maybe you should segment your data into smaller pieces?
I don't think so, I give you a tensor data of 1 sample:
_data shape: torch.Size([3, 2])
data value:
tensor([[30893, 30893],
[18171, 27414],
[27414, 0]])
target shape: torch.Size([6])
target value:
tensor([18171, 27414, 27414, 30893, 30893, 0])
seq lens shape: torch.Size([2])
seq lens value:
tensor([3, 2])_
So I think 1153 is number of arc path (number hypothesis of full lattice), and 38 is max length of 1153 arc. I just print the data with text raw and confirm this idea is right. Pls correct me if i'm get wrong ? tks Dan.
Oh sorry, you're right. You may have to either change the code to split into batches of a smaller size if num-paths is above some threshold, or put some limit on num-paths at the path-generating stage (there might be an option to whatever Kaldi program was used).
tks @danpovey , with option 1 I can do it, but as best solution in my mind. I think I need limit the num-paths at the path generation stage. But now, I am not familiar with code base in kaldi. Can you suggest what can i change in file kaldi/src/latbin/lattice-expand.cc
or something like this. I imagine it only needs minor modification for adding variable limit num-path per utterance. But I don't know where I need change.
Thanks
You'd have to change lattice-path-cover.cc, adding an option like max-paths. You'd have to make sure the paths were sorted from best to worst [actually they do seem to be, there is a std::sort in there], before truncating. There seems to be another problem that needs to be addressed in that program https://github.com/kaldi-asr/kaldi/issues/4719 you also created that issue.
in my issuse #4719, I remove The assertion: KALDI_ASSERT(clat.NumStates() > 1); as you suggest and it work. I understand your idea. I will try some fix and report later. I think when I done fix all issuse, I will make pr later. Tks.
hi @danpovey I have add some code for limit size of paths_and_costs in function ComputePathCover ( https://github.com/kaldi-asr/kaldi/blob/aefbd096ec0c7f1136f669c99be66ac393afe29c/src/latbin/lattice-path-cover.cc#L174) and the problem with size of arcpath has resolve. But, I have another error in file nnlmrescore.1.log. The error log like: And I know it's not because my fix code in lattice-path-cover.cc, because another running experiment before this fix still has this error but because I don't focus to understand, so now I'm just realized the problem. Does you have any suggest. tks
It's complaining that the key (utterance-id or path-id which utterance-id-N or something like that) is not present in an input archive. That should not really be an assertion, it should be either an error or warning, as it's a problem with the input. You could perhaps change the code to print out a warning and just output the compact lattice unchanged if that happens.
... but it likely indicates some kind of problem in a previous script, e.g. did not have all the output it should have had.
so, I don't really get where can i start to debug. I run rescore with n-best all is ok? . I will try remove assert to get error if have. I will give you detail of log. Tks.
I attack the error message here: @danpovey
I think I see the issue. It's not actually OK to limit the number of paths in lattice-path-cover because the rescoring logic relies on all arcs being covered by at least one arc.
It might be necessary to include a 'lattice-limit-depth' pruning command at an earlier stage in the script, i.e. when dumping lattices, to limit the number of paths in the lattice.
but, as i said. My experiment before i have code fix limit numper of path in lattice-path-cover. The experiment with lattice rescoring still exist error. Anyway, I am not sure I can know how to add option lattice-limit-depth like your suggest. Can you give me a fix detailed. Tks
Sorry I don't have time for such detailed help.
yep, thank you. So can we discuss again about idea lattice limit depth, I understand that I need change some code in lattice-expand. But I don't think this is root cause. Because When I don't change anything with lilmit path. I still get this error assert when running lattice rescore. So I just want you can think more, and give me another idea ?
Finding the root cause of your problem would require some debugging, looking at files, etc. You need to do that yourself as best you can. the lattice-limit-depth thing would be a script level change, adding a new command in a pipe.
thank you. I will try my self.
@danpovey I've been looking in adding some pruning in the iterative rescore, do you think that lattice-limit-depth
is a better fit than lattice-determinize-pruned
?
It would make sense, as the number of paths to cover is more tied to the depth than to the posteriors of the arcs.
If you are trying to avoid OOM, in a later stage lattice-limit-depth will tend to be a better tradeoff I think.
I'm rather trying to reach the same WERs as my previous rescoring method without making batches too big. I'm still a bit off in terms of WERs, despite getting batches of size 10,000 for some segments, which isn't going to be great for RTF.
@rikrd can you share your solution for fix cuda OOM?
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.
hi All, I have a problem when running lattice rescoring with transformer follow script: _kaldi/egs/wsj/s5/local/pytorchnn/runnnlm.sh. Script lmrescore_lattice_pytorchnn.sh when computing neural lm rescore at each utterance it's failed because the arc lattice is so large: with utterance like: [29, 1100] ( it's has max length 29, and 1100 arc path). So my cuda memory don't enough size for compute this batch of utterance. So I am trying to reduce arc path, but I don't think it's easy because it relation with other component. So pls can give me an idea how to fix it ? Tks.