getting times per phone in n-best list

danpovey commented 3 years ago

Guys, Does anyone want to take on the task of getting expected times per phone in n-best lists?

FINALLY submitting this to the right repo.

Would be something like:

top-level objective function is phone-bigram LF-MMI type.
decode to get lattice, we can decide whether to use the phone-bigram or word-unigram (or any other) decoding graph, not important for now.
get n-best unique phone sequences per lattices, keep multiplicity.
Do numerator alignment for each of the n paths. (will require the a_to_b_map option of intersect_dense).
Using the seqphone_idx attribute, created when the n-best paths were generated by adding a range expression as an attribute (this counts just phones, not epsilons) to get a sparse matrix of posteriors with row-index == frame_idx and col-index == seqphone_idx.
Multiply the posteriors by the frame_idx and then sum over the columns of the sparse matrix of posteriors to get the total (frame_idx * occupation prob) for each seqphone_idx, and divide by the total occupation_prob for each seqphone_idx (This normalization is required because each phone may appear multiple times, unless we filter by a 'phone' attribute).

csukuangfj commented 3 years ago

I can start to implement it tomorrow if no one else is going to take it .

Wednesday, 10 February 2021, 15:59 +0800 from notifications@github.com notifications@github.com:

Guys, Does anyone want to take on the task of getting expected times per phone in n-best lists? FINALLY submitting this to the right repo. Would be something like:

top-level objective function is phone-bigram LF-MMI type.

decode to get lattice, we can decide whether to use the phone-bigram or word-unigram (or any other) decoding graph, not important for now.

get n-best unique phone sequences per lattices, keep multiplicity.

Do numerator alignment for each of the n paths. (will require the a_to_b_map option of intersect_dense).

Using the seqphone_idx attribute, created when the n-best paths were generated by adding a range expression as an attribute (this counts just phones, not epsilons) to get a sparse matrix of posteriors with row-index == frame_idx and col-index == seqphone_idx.

Multiply the posteriors by the frame_idx and then sum over the columns of the sparse matrix of posteriors to get the total (frame_idx * occupation prob) for each seqphone_idx, and divide by the total occupation_prob for each seqphone_idx (This normalization is required because each phone may appear multiple times, unless we filter by a 'phone' attribute). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or unsubscribe .

danpovey commented 3 years ago

Great!

On Wed, Feb 10, 2021 at 4:07 PM Fangjun Kuang notifications@github.com wrote:

I can start to implement it tomorrow if no one else is going to take it .

Wednesday, 10 February 2021, 15:59 +0800 from notifications@github.com < notifications@github.com>:

Guys, Does anyone want to take on the task of getting expected times per phone in n-best lists? FINALLY submitting this to the right repo. Would be something like:

top-level objective function is phone-bigram LF-MMI type.

decode to get lattice, we can decide whether to use the phone-bigram or word-unigram (or any other) decoding graph, not important for now.

get n-best unique phone sequences per lattices, keep multiplicity.

Do numerator alignment for each of the n paths. (will require the a_to_b_map option of intersect_dense).

Using the seqphone_idx attribute, created when the n-best paths were generated by adding a range expression as an attribute (this counts just phones, not epsilons) to get a sparse matrix of posteriors with row-index == frame_idx and col-index == seqphone_idx.

Multiply the posteriors by the frame_idx and then sum over the columns of the sparse matrix of posteriors to get the total (frame_idx * occupation prob) for each seqphone_idx, and divide by the total occupation_prob for each seqphone_idx (This normalization is required because each phone may appear multiple times, unless we filter by a 'phone' attribute). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or unsubscribe .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/96#issuecomment-776522819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOY4KKRDDTPAP52K3H3S6I5D5ANCNFSM4XMPCITA .

csukuangfj commented 3 years ago

Sorry, I am busy these days. Will have time after the festival (the day after tomorrow).

danpovey commented 3 years ago

cool

On Monday, February 15, 2021, Fangjun Kuang notifications@github.com wrote:

Sorry, I am busy these days. Will have time after the festival (the day after tomorrow).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/96#issuecomment-778869197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZZGCJMRTULND63RTTS7BSQLANCNFSM4XMPCITA .

csukuangfj commented 3 years ago

@danpovey I still have a few doubts about the comments.

(1)

Do numerator alignment for each of the n paths. (will require the a_to_b_map option of intersect_dense).

https://github.com/k2-fsa/k2/issues/641 says

      paths_lats = k2.intersect_device(mbr_lats, paths_decoding_graphs, path_to_seq_map)

As you are using numerator alignment in this issue, are the n paths generated from the num_graph and are we going to use

      paths_lats = k2.intersect_dense(paths_decoding_graphs, dense_fsa_vec, path_to_seq_map)

(2)

Using the seqphone_idx attribute, created when the n-best paths were generated by adding a range expression as an attribute (this counts just phones, not epsilons) to get a sparse matrix of posteriors with row-index == frame_idx and col-index == seqphone_idx.

Multiply the posteriors by the frame_idx and then sum over the columns of the sparse matrix of posteriors to get the total (frame_idx * occupation prob) for each seqphone_idx, and divide by the total occupation_prob for each seqphone_idx (This normalization is required because each phone may appear multiple times, unless we filter by a 'phone' attribute).

Just want to make sure that frame_idx is not a typo here. Do you mean frame_idx, seqframe_idx or pathframe_idx?

(3)

get n-best unique phone sequences per lattices, keep multiplicity.

How is the multiplicity information going to be used?

danpovey commented 3 years ago

@danpovey I still have a few doubts about the comments.

(1)

Do numerator alignment for each of the n paths. (will require the a_to_b_map option of intersect_dense).

k2-fsa/k2#641 says
      paths_lats = k2.intersect_device(mbr_lats, paths_decoding_graphs, path_to_seq_map)
As you are using numerator alignment in this issue, are the n paths generated from the num_graph and are we going to use
      paths_lats = k2.intersect_dense(paths_decoding_graphs, dense_fsa_vec, path_to_seq_map)
Oh yes, dense not device.

(2)

Using the seqphone_idx attribute, created when the n-best paths were generated by adding a range expression as an attribute (this counts just phones, not epsilons) to get a sparse matrix of posteriors with row-index == frame_idx and col-index == seqphone_idx.

Multiply the posteriors by the frame_idx and then sum over the columns of the sparse matrix of posteriors to get the total (frame_idx * occupation prob) for each seqphone_idx, and divide by the total occupation_prob for each seqphone_idx (This normalization is required because each phone may appear multiple times, unless we filter by a 'phone' attribute).

Just want to make sure that frame_idx is not a typo here. Do you mean frame_idx, seqframe_idx or pathframe_idx?

I meant frame_idx. The idea is that we are getting the weighted-averaged times. Doing this on the frame_idx will be better for roundoff reasons, since it's much smaller, even if we later decide for some reason to map back to the seqframe_idx or pathframe_idx when we actually use it.

(3)

get n-best unique phone sequences per lattices, keep multiplicity.

How is the multiplicity information going to be used?

It will be used after we get the n-best list from the 2nd-pass lattices and count which sequence had the most counts. The idea is that we're finding which sequence has the most probability mass after the 2nd pass, and we view the 1st-pass sequences as pseudo-random samples (actually samples taken at fixed intervals) from a distribution.

csukuangfj commented 3 years ago

Thanks.

csukuangfj commented 3 years ago

I find that

      paths_lats = k2.intersect_dense(paths_decoding_graphs, dense_fsa_vec, 10.0, path_to_seq_map)

may produce empty FSAs.

Assume the number of input frames is 5 and the transcript is c b a.

get n-best unique phone sequences per lattices, keep multiplicity.

We may get a path with phone sequence c b b b a, which means it must have at least 7 input frames in order to get a non-empty intersection result when invoking k2.intersect_dense, because the phone sequence c b b b a contains repeated phones and it needs at least one blank to separate the repeated phones.

@danpovey Will the empty FSAs be a problem?

danpovey commented 3 years ago

Mm, the original paths should not have the phones repeated unless they were repeated in the original phone sequence, assuming we originally got them from the 'phones' property (instead of the labels). So there should be no paths that have "too many" phones, since they survived the first decoding.

Also regarding the blanks separating the phones: these should all be optional.

csukuangfj commented 3 years ago

Turns out there is a bug in k2.compose when setting the inner_labels. Fixing it.

csukuangfj commented 3 years ago

I meant frame_idx.

Since a single frame_idx may correspond to multiple pathframe_idx, are the posteriors of these pathframe_idx combined with addition?

danpovey commented 3 years ago

The frame_idx only participates as if it were a floating-point scale, in a weighted sum, to find the average frame (average time within the sequence). The actual indexes are still seqframe_idx.

csukuangfj commented 3 years ago

The actual indexes are still seqframe_idx.

Ah, I see. So in the folliwng, is row_index actually seqframe_idx, not frame_idx ?

Using the seqphone_idx attribute, created when the n-best paths were generated by adding a range expression as an attribute (this counts just phones, not epsilons) to get a sparse matrix of posteriors with row-index == frame_idx and col-index == seqphone_idx.

danpovey commented 3 years ago

yes you are right, sorry.

csukuangfj commented 3 years ago

https://github.com/k2-fsa/k2/issues/641 says:

This will get propagated from mbr_lats into paths_lats, but that's not actually the seqframe_idx we want to create the posteriors (since it collapses the posteriors from different paths of the same FSA).

and https://github.com/k2-fsa/k2/issues/641 uses pathframe_idx_to_seqframe_idx to build num_phone_post_paths from num_phone_post:

    num_phone_post_paths = num_phone_post[pathframe_idx_to_seqframe_idx]

In this issue, we want a sparse matrix of posteriors with col_index == seqphone_idx but the num_lats does not have the attribute seqphone_idx. @danpovey Would you mind describing the creation of the sparse matrix of posteriors in this issue?

csukuangfj commented 3 years ago

@danpovey Would you mind describing the creation of the sparse matrix of posteriors in this issue?

I know that the posteriors are contained in num_lats, but num_lats does not contain seqphone_idx. path_lats contains seqphone_idx but it does not contain posteriors.

Cannot figure out how to create the sparse matrix in this case.

danpovey commented 3 years ago

Deleted my previous reply as it was wrong. We resolved this offline. The plan right now may be a little different from things I sketched out. Our main task is to get the times of the phones in the paths that we use for the 2nd-pass alignment, so we can create embeddings. We plan to attach an attribute to the 2nd-pass alignment graphs, called pathphone_idx, and later on, when we have the 2nd-pass lattices, create posteriors that are a sparse matrix indexed (pathframe_idx, pathphone_idx). We can manipulate the columns of this sparse matrix to get average times.

In fact, we could perhaps convert pathframe_idx to frame_idx and compute weighted averages or weighted sums per column to get the average frame-index for each pathphone_idx; this will cause less roundoff than averaging on pathframe_idx, but perhaps with double precision it would be OK regardless. We'll probably later use the times derived from this process to create embeddings by interpolating between frames.

k2-fsa / snowfall

getting times per phone in n-best list #96