Closed albertz closed 6 months ago
While testing this, I noticed, I think sclite requires that the sequences are in the same order as in the STM file.
Doesn't the sort_files
flag of ScliteJob solve this?
Doesn't the
sort_files
flag of ScliteJob solve this?
Ah maybe it does, I did not knew about this option. Will try soon. But I think this option seq_order_file
might anyway be useful. But then I make it optional here.
Doesn't the
sort_files
flag of ScliteJob solve this?Ah maybe it does, I did not knew about this option. Will try soon. But I think this option
seq_order_file
might anyway be useful. But then I make it optional here.
If you are using hubscr.pl
from sclite as part of the pipeline, the sorting will be done automatically. Thats at least how it is done in the ApptekScorer
from apptek_asr
to score.
And if you are using "dummy time stamps" here I just recommend to use the FullFileApptekScorer
so you are good to score over the whole audio file without any time stamps taken into account. :thinking:
Ok good to know.
So, what about the PR here now? Code is good now? Can be merged?
Code looks good to me, I am just wondering: Do we maybe want to have an if check for the assert isinstance(d, dict), "only search output file with dict format is supported"
and raise NotImplementedError
? Then we can maybe extend this job more freely in the future. But I dont have a strong opinion on this.
Code looks good to me, I am just wondering: Do we maybe want to have an if check for the
assert isinstance(d, dict), "only search output file with dict format is supported"
andraise NotImplementedError
? Then we can maybe extend this job more freely in the future. But I dont have a strong opinion on this.
I don't think the type of exception (AssertionError
vs NotImplementedError
) is really critical here, and you save one line of code with assert
.
While testing this, I noticed, I think sclite requires that the sequences are in the same order as in the STM file. The code here just uses the same order as the search output, which is often a different order (e.g. reversely ordered by seq len). So, this is not useful in this form. There needs to be another option which defines the seq order. E.g. via an existing reference text-dict file (e.g. via
CorpusToTextDictJob
) or maybe the STM itself.