k2-fsa / text_search

Some fast-ish algorithms for batch text search in moderate-sized collections, intended for data cleanup
https://k2-fsa.github.io/text_search/
56 stars 14 forks source link

Associate multiple version of reference #54

Open pkufool opened 1 year ago

pkufool commented 1 year ago

Some audios might have multiple versions of reference (for example the youtube have automatic and manual subtitles), to associate both of these reference to the audio segments, I think we can first align the audio with one of the reference, and then we can get the "begin_byte" and "end_byte" of the choosen reference for each segment. We can associate the second reference by doing a levenshtein alignment between the first reference and the substring of second reference determined by "begin_byte" and "end_byte" (of course, need some extending on both side), we suppose the two references are very close. If doing this way, we don't have to change the core part of our code, we just store the second reference in a custom filed in the cut, and associate it when writing out the results, the segments are short (from 2 seconds to 30 seconds) so the levenshtein alignment would be very fast and we can do it in parallel with multiple cpu cores.

@danpovey @npovey If you need help at this, it would be good if you can share a small subset of the data to me so I can add an example recipe to the project.