jitsi / jiwer

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)
Apache License 2.0
570 stars 92 forks source link

get indecies of I,D,S #90

Open hallelhel opened 1 week ago

hallelhel commented 1 week ago

I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it always give index depand on hypo text. for example if the referance is : " I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it " and the hypo is: "for words"

the number of word deleted is right but indecies of deletion depand on length of hypo, I mean word in index 7 in refance was deleted and I didnt get it in alignment_chunk. Do you have some way to get the all indecies in the sentece were deleted?

nikvaessen commented 1 week ago

With the following code

import jiwer

ref = "I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it"
hyp = "for words"

r = jiwer.process_words(ref, hyp)

for a in r.alignments[0]:
    print(a)

you get these allignment chunks:

AlignmentChunk(type='delete', ref_start_idx=0, ref_end_idx=5, hyp_start_idx=0, hyp_end_idx=0)
AlignmentChunk(type='equal', ref_start_idx=5, ref_end_idx=7, hyp_start_idx=0, hyp_end_idx=2)
AlignmentChunk(type='delete', ref_start_idx=7, ref_end_idx=21, hyp_start_idx=2, hyp_end_idx=2)

meaning that in the reference, index 0, 1, 2, 3 and 4 are deleted, as well as index 7, ..., 20. Note here that the ref_end_idx is excluded in the range.

This can also be observed with a call to jiwer.visualize_alignment:

import jiwer

ref = "I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it"
hyp = "for words"

r = jiwer.process_words(ref, hyp)
print(jiwer.visualize_alignment(r, show_measures=False))

which returns

sentence 1
REF: I trying to get indecies for words subtituted, inserted and deleted I found the alignments.alignment_chunk and seems not bad but it
HYP: * ****** ** *** ******** for words *********** ******** *** ******* * ***** *** ************************** *** ***** *** *** *** **
     D      D  D   D        D                     D        D   D       D D     D   D                          D   D     D   D   D   D  D
hallelhel commented 1 week ago

thanks :) its look like when the 2 sentene has minor match (only specific words appear in 2 sentences) the first word is always subtitited for example I have this 2 sentences: ref = 'On Monday, the French newspaper Le Parisien reported that a couple who arrived with their three-year-old daughter at a hotel in Paris 15th arrondissement encountered a receptionist who refused to confirm their reservation, and even threw them out into the street while telling them: "You will not get a room in this . The family filed a complaint with the Paris police.'

hyp = 'why couple apple hotel banana with the Paris police'

the result: subtitution array - [0, 11, 20, 61] deletion array - [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57]

I want to get the indexes of the words for every type (S,D,I) and its look abit tricky

nikvaessen commented 1 week ago

You can get the index arrays like follows:

import jiwer

ref = "On Monday, the French newspaper Le Parisien reported that a couple who arrived with their three-year-old daughter at a hotel in Paris 15th arrondissement encountered a receptionist who refused to confirm their reservation, and even threw them out into the street while telling them: You will not get a room in this. The family filed a complaint with the Paris police."
hyp = "why couple apple hotel banana with the Paris police"

r = jiwer.process_words(ref, hyp)

sub_idx = []
del_idx = []
ins_idx = []

for a in r.alignments[0]:
    ref_idx = range(a.ref_start_idx, a.ref_end_idx)
    if a.type == "substitute":
        sub_idx.extend(ref_idx)
    elif a.type == "delete":
        del_idx.extend(ref_idx)
    elif a.type == "insert":
        ins_idx.extend(ref_idx)

its look like when the 2 sentene has minor match (only specific words appear in 2 sentences) the first word is always subtitited

Yes, this seems expected, and I don't see the issue.