revdotcom / fstalign

An efficient OpenFST-based tool for calculating WER and aligning two transcript sequences.
Apache License 2.0
157 stars 8 forks source link

Rules containing wildcards #36

Closed huangruizhe closed 2 years ago

huangruizhe commented 2 years ago

Hello, do you plan to enable wildcards in the file of custom synonyms?

For example:

[w]'s    |   [w] is
[w]'s    |   [w] has

where [w] can match any word. This is quite common, as [w] can be a person's or company's name.

huangruizhe commented 2 years ago

Another possible rule is:

[w1] [w2]    |   [w1][w2]

where the space between [w1] and [w2] is optional. For example, "touch tone" vs. "touchtone". What do you think?

nishchalb commented 2 years ago

Hi, the synonym feature is intended to be conservative to avoid capturing things that don't make sense, e.g. for your example sally's code would be treated as equivalent to sally is code.

But, if you would like to proceed with those examples, you could either:

huangruizhe commented 2 years ago

Thanks for the suggestion! Yeah, that should also work.