k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.08k stars 211 forks source link

Use k2 to represent regex in decoding #1295

Open ZuoyunZheng opened 1 week ago

ZuoyunZheng commented 1 week ago

Hi all,

i'm trying to use k2 in OCR decoding.

I have a CTC OCR model. The characters that i'm trying to recognize are very structured that i can use regular expressions to express them. Limited by the application, my model has to be relatively small, but more because of no in-domain training data, i have to use regular expressions to make my model recognize more robustly. So i figured, my decodinggraph would just be k2.compose(k2.ctc_topo(num_characters), regex_fsa), where i have to build an acceptor from the regex. I tried with a minimal example using a regex \d{4} and this works but i'm not sure for regex such as (\d)+, where there are transitions with epsilons. regex

My questions are:

  1. would this be the best way to incorporate regex in decoding?
  2. since i'm building the regex fsa from python's re myself, the fsa that i generate is not minimized at all (due to epsilon loops for regex quantifiers such as *+?). Is there a way for k2 to minimize the fsa for me?

I'm not that versed in WFST so i appreciate any input and suggestions.

Best Regards, Zuoyun Zheng.