rwth-i6 / i6_core

Sisyphus recipies for ASR
Mozilla Public License 2.0
16 stars 23 forks source link

SearchWordsDummyTimesToCTMJob #503

Closed albertz closed 6 months ago

albertz commented 6 months ago

While testing this, I noticed, I think sclite requires that the sequences are in the same order as in the STM file. The code here just uses the same order as the search output, which is often a different order (e.g. reversely ordered by seq len). So, this is not useful in this form. There needs to be another option which defines the seq order. E.g. via an existing reference text-dict file (e.g. via CorpusToTextDictJob) or maybe the STM itself.

JackTemaki commented 6 months ago

While testing this, I noticed, I think sclite requires that the sequences are in the same order as in the STM file.

Doesn't the sort_files flag of ScliteJob solve this?

albertz commented 6 months ago

Doesn't the sort_files flag of ScliteJob solve this?

Ah maybe it does, I did not knew about this option. Will try soon. But I think this option seq_order_file might anyway be useful. But then I make it optional here.

sarahberanek commented 6 months ago

Doesn't the sort_files flag of ScliteJob solve this?

Ah maybe it does, I did not knew about this option. Will try soon. But I think this option seq_order_file might anyway be useful. But then I make it optional here.

If you are using hubscr.pl from sclite as part of the pipeline, the sorting will be done automatically. Thats at least how it is done in the ApptekScorer from apptek_asr to score.

And if you are using "dummy time stamps" here I just recommend to use the FullFileApptekScorer so you are good to score over the whole audio file without any time stamps taken into account. :thinking:

albertz commented 6 months ago

Ok good to know.

So, what about the PR here now? Code is good now? Can be merged?

Atticus1806 commented 6 months ago

Code looks good to me, I am just wondering: Do we maybe want to have an if check for the assert isinstance(d, dict), "only search output file with dict format is supported" and raise NotImplementedError? Then we can maybe extend this job more freely in the future. But I dont have a strong opinion on this.

albertz commented 6 months ago

Code looks good to me, I am just wondering: Do we maybe want to have an if check for the assert isinstance(d, dict), "only search output file with dict format is supported" and raise NotImplementedError? Then we can maybe extend this job more freely in the future. But I dont have a strong opinion on this.

I don't think the type of exception (AssertionError vs NotImplementedError) is really critical here, and you save one line of code with assert.