I have a CTC OCR model. The characters that i'm trying to recognize are very structured that i can use regular expressions to express them. Limited by the application, my model has to be relatively small, but more because of no in-domain training data, i have to use regular expressions to make my model recognize more robustly. So i figured, my decodinggraph would just be k2.compose(k2.ctc_topo(num_characters), regex_fsa), where i have to build an acceptor from the regex. I tried with a minimal example using a regex \d{4} and this works but i'm not sure for regex such as (\d)+, where there are transitions with epsilons.
My questions are:
would this be the best way to incorporate regex in decoding?
since i'm building the regex fsa from python's re myself, the fsa that i generate is not minimized at all (due to epsilon loops for regex quantifiers such as *+?). Is there a way for k2 to minimize the fsa for me?
I'm not that versed in WFST so i appreciate any input and suggestions.
Hi all,
i'm trying to use k2 in OCR decoding.
I have a CTC OCR model. The characters that i'm trying to recognize are very structured that i can use regular expressions to express them. Limited by the application, my model has to be relatively small, but more because of no in-domain training data, i have to use regular expressions to make my model recognize more robustly. So i figured, my decodinggraph would just be
k2.compose(k2.ctc_topo(num_characters), regex_fsa)
, where i have to build an acceptor from the regex. I tried with a minimal example using a regex\d{4}
and this works but i'm not sure for regex such as(\d)+
, where there are transitions with epsilons.My questions are:
re
myself, the fsa that i generate is not minimized at all (due to epsilon loops for regex quantifiers such as *+?). Is there a way for k2 to minimize the fsa for me?I'm not that versed in WFST so i appreciate any input and suggestions.
Best Regards, Zuoyun Zheng.