Open Jeronymous opened 1 year ago
As I wrote on #125 :
The paper only mentions how the "true positive" are counted, which is kind of the easy part. But what is done when a target word correspond to several predicted word (for instance)? How are silence segments taken into account exactly?... I'm discussing this with a colleague and we are imagining several implementations that make sense. Clarifying this would be super helpful to reproduce the benchmark
It would be awesome to have an update of the paper regarding this.
Interested to hear about this issue as well. As it is currently explained, the metrics seems interesting but they are several ways (and different results) for implementing it.
@m-bain Any chance to have an answer on this?
Even an isolated not working piece of code would help to understand how Recall/Precision are exactly computed.
Thanks
Hi,
Sorry for the delay.
It is simply calculated as so:
Precision:
For each predicted word, check if an exact string match (normalized) occurs in the list of ground truth words -- within collar margin. If yes, count as a TP.
Recall:
For each ground truth word, check ... same as above but against predicted words
Is it possible to see the code used to compute the evaluation metrics as in the paper ("WhisperX: Time-Accurate Speech Transcription of Long-Form Audio")? (as the devil's in the details...)