k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
902 stars 287 forks source link

Hyps and sclite etc. and the role of Lhotse #209

Open danpovey opened 2 years ago

danpovey commented 2 years ago

Creating this issue from an email thread initiated by @ngoel17. Nagendra:

Hi Piotr and Daniel,

I notice that icefall stores transcripts using this. https://github.com/k2->fsa/icefall/blob/be1c86b06cbaa3b3d63f0b129ee54293176d95b5/icefall/utils.py#L311

It's kind of useless for later processing (such as with sclite) because the audio segment ID is missing from this output.

If the output is in CTM format, it's more usable for sclite, etc also. I am also wondering how I would (or could) use this to create >special supervisions in Lhotse? Sometimes I may want this to be the supervision of future training, and sometimes I would use this to filter the training data.

Is there a standard way of doing all this, which I haven't noticed yet? Nagendra

@pzelasko :

I think it’d be straightforward to add a method for SupervisionSet <=> CTM mapping (both ways) but I don’t believe anybody implemented it yet. In the decoding script you’d have to read out segment IDs from "cuts[i].supervisions[0].id” for i-th example in the mini-batch.

@ngoel17:

Thanks, Piotr, and also for the help earlier. I notice the order is preserved. Right now I just use that order because we are somewhat at the mercy of pytorch dataloader which could technically be allowed to re-arrange things. I like the idea that there could be multiple supervisions in a cut, but then I will always be looking for supervision[0] really and ignoring supervision[1] even if it exists. I am just thinking aloud. No need to reply unless I got it all wrong.

@pzelasko:

You got it all right actually :)

Re PyTorch dataloader: you can use lhotse.dataset.SimpleCutSampler(test_cuts, shuffle=True) as your sampler, and it will preserve the input order (i.e. the same one you have in the CutSet). It will be a bit less efficient to decode because there will be more padding though. You’d also likely need to set num_workers=0 or num_workers=1 in the DataLoader though, otherwise the cuts might get re-ordered via competing parallel workers.

If you work with multi-supervision cuts, you can have nested for loops to handle cuts + their supervisions; again if you care about the order, you can use non-reordering samplers to ensure it’s not changed. Or you can re-sort the data later (e.g. dump the CutSet with recognized texts, then use cuts.sort_like(other_cuts) and dump again).

@pzelasko:

Hmm I just remembered I had a function for sclite export in snowfall’s Gigaspeech recipe (which is now deprecated).

https://github.com/k2-fsa/snowfall/blob/911198817edc7b306265f32447ef8a7dc5cfa8f2/snowfall/common.py#L370-L378

used here:

https://github.com/k2-fsa/snowfall/blob/master/egs/gigaspeech/asr/simple_v1/mmi_att_transformer_decode.py#L604

ngoel17 commented 2 years ago

Just for the same of generality - sometimes we like to decode without having the reference transcript (supervision not available) because we want to get an idea even if we cannot do the WER. If that case is also covered it will be good. I.e. cases where either only wav.scp is there or wav.scp + segments file is there (in Kaldi format). The c++ executables that are in k2 will probably handle such things with proper wrappers, but if Lhotse based python environment also had a prototype, that will be good.