usnistgov / SCTK

Other
208 stars 52 forks source link

Fix sclite by enabling larger backtrace indices. #34

Closed arlofaria closed 2 years ago

arlofaria commented 2 years ago

Problem: the sclite tool's dynamic programming data structure assumes that a short ought to be sufficient for storing the backtrace indices for an optimal alignment. However, when dealing with exceptionally large lists of alternatives, this assumption can be violated, resulting in overflow that leads to a segmentation fault as negative indices are accessed.

Solution: use an int which can represent a much greater range of indices. Alternatively, use size_t; an advantage of using int is that it makes it easier to debug when something goes wrong, since you'll spot the negative indices.

Drawbacks: This comes at the expense of increased memory usage. Also, the sclite tool may take a rather long time to compute alignments for these large inputs. Arguably, the current behavior of crashing is desirable since it ensures that sclite terminates quickly without using lots of memory. In my experiments with the Hub5 Switchboard evaluation, I can get worst-case scenarios where it runs for an hour or more and uses upwards of 10GiB of memory.

Context: this bug was exposed when leveraging the relatively uncommon functionality of representing hypothesis alternatives in the CTM format via the <ALT> tags. This can be useful for representing word-level alternatives (or even utterance-level, i.e. N-best lists) to compute a so-called "oracle" scoring metric in which the best-matching alternative is chosen for the alignment. This is probably not how sclite was intended to be used, but it can be quite interesting. For other tools that do this kind of oracle scoring, see:

There is a problem, however, when processing such CTM files through a script such as csrfilt.sh (which calls rfilter1) to apply a GLM to produce a "filtered" CTM that also includes such alternatives, e.g. for representing contractions or variant spellings. The problem is that sclite does not seem to properly handle alternative tags that are nested (e.g. you can't have <ALT_BEGIN>\n<ALT_BEGIN>...). It's not clear if that's a bug or the intended specification.

A workaround is to "expand" the nested alternatives as the Cartesian product of alternatives at a singly-nested level of alternatives. For example, I can create this original CTM to indicate word-level alternatives.

sw_4390 A * * <ALT_BEGIN>
sw_4390 A 4.49 0.66 um
sw_4390 A * * <ALT>
sw_4390 A 4.49 0.66 i'm
sw_4390 A * * <ALT_END> 

But filtering it with a GLM can produce these doubly-nested alternatives:

sw_4390 A * * <ALT_BEGIN>
sw_4390 A 4.490 0.660 %HESITATION
sw_4390 A * * <ALT>
sw_4390 A * * <ALT_BEGIN>
sw_4390 A 4.490 0.660 I'M
sw_4390 A * * <ALT>
sw_4390 A 4.490 0.330 I
sw_4390 A 4.820 0.330 AM
sw_4390 A * * <ALT_END>
sw_4390 A * * <ALT_END>

... which won't work correctly in sclite … though we can "expand" that to equivalent singly-nested alternatives:

sw_4390 A * * <ALT_BEGIN>
sw_4390 A 4.490 0.660 %HESITATION 0.8535
sw_4390 A * * <ALT>
sw_4390 A 4.490 0.660 I'M 0.8535
sw_4390 A * * <ALT>
sw_4390 A 4.490 0.330 I 0.8535
sw_4390 A 4.820 0.330 AM 0.8535
sw_4390 A * * <ALT_END>

This approach works fine for word-level alternatives ... but for utterance-level alternatives the expansions can become relatively large. An N-best list of, say, 100 alternatives could be expanded to several thousand alternatives if there are enough words that can expanded by the GLM. Multiply that for a relatively long utterance of a few dozen words and you're now dealing with several tens of thousands of potential backtrace indices ... once you get above 2**15, the short will overflow and you'll encounter a segfault.

jfiscus commented 2 years ago

Thanks @arlofaria for the detailed writeup. Indeed, the CTM alternation mechanism does not support recursive definitions. STM files do however. I always expected that I'd write a lattice importer to deal with the problems you noted but never had the chance/need.

With regard to the int vs. short back pointers, I don't think it's a good idea to make this an int because it will change the RAM usage a lot. I am going to make these changes instead: (1) fail before attempting an alignment that exceeds the matrix and point to the typedef to change if they need the extra search space and (2) change this to an unsigned short which will double the default size without increasing RAM. This will extend the matrix to 65535^2 which would consume 50GB of RAM.

jfiscus commented 2 years ago

Closing this MR. It was merged by hand.

arlofaria commented 2 years ago

Hi @jfiscus, glad to see that you're back!

Indeed, the CTM alternation mechanism does not support recursive definitions.

Do you think this would be a relatively easy issue (for someone else) to fix?

With regard to the int vs. short back pointers, I don't think it's a good idea to make this an int because it will change the RAM usage a lot.

That's fair. It would double the usage for most existing use cases and that might be considered a major backwards-incompatible change in functionality.

I am going to make these changes instead: (1) fail before attempting an alignment that exceeds the matrix and point to the typedef to change if they need the extra search space

Sounds good! That's much nicer than a segfault :-)

and (2) change this to an unsigned short which will double the default size without increasing RAM. This will extend the matrix to 65535^2 which would consume 50GB of RAM.

... that helps too, thanks!

jfiscus commented 2 years ago

@arlofaria good to be back. I finally have some breathing room to work on SCTK. I think it's be possible to implement recursive CTM alternations but if I had my druthers I'd like to see a lattice input instead. With a lattice you could avoid a lot of replication that a recursive definition forces you to use.