Closed TinoDidriksen closed 3 years ago
@TinoDidriksen dot-hash-dot stands for word boundary symbol https://github.com/mhulden/foma/blob/master/foma/docs/simpleintro.md#word-boundaries-in-rewrite-rules
IIUC This code adds word boundary symbol to sigma if it doesn't contain it yet. Therefore I'd suggest to rename it to skip_merge_dothashdot
to skip_word_boundary_marker
The word-boundary marker has no special meaning in other regular expression constructs; however, any .#.-symbols found in a context specification will always be interpreted as a word boundary, and the symbol is removed from the alphabet at the end of replacement rule or context restriction compilation.
See also:
Word boundary marker is being inserted to alphabet in HFST as well when rules are applied
Renamed.
As about segfault related to sigmasize overflow, it looks like 30 magic number is arbitrary.
3 is used because there are exactly 3 special symbols on arcs
It may work for large enough sigmas until we hit even bigger sigma. Looking at its usage,
it looks like there's sigma->number
which exceeds 3 and may exceed 30. Therefore we may find maximum value of sigma->number
of both net1
and net2
and use that number instead of 3 or 30.
In other words, let's check if the following change does the trick:
sigmasizes = sigma_max(net1->sigma) + sigma_max(net2->sigma) + 3
If it is possible, it would be best to verify solution on some real world dataset which is known to fail with current Foma.
Also I've created related pull requests which attempt to fix this bug:
@TinoDidriksen Could you please fix merge conflicts?
Done.
@mhulden This pull request looks good to me. What do you think?
Not currently passing build checks. Fixable?
It had come out of sync with the HFST changes - try now.
This PR adds code to fascilitate HFST using Foma directly ( https://github.com/hfst/hfst/issues/485#issuecomment-713512736 & https://github.com/hfst/hfst/tree/foma ).
sigmasizes+3
tosigmasizes+30
is trivial - HFST just needs more memory sometimes, and this must be relevant for other users.But the bigger change is adding API
fsm_set_option()
andfsm_get_option()
because HFST needs to disable a piece of Foma's code. And I figured if I'm adding a way to set one option, I might as well make it extensible for potential future options.The option should maybe be renamed - I don't know what the code semantically does, so I named it dot-hash-dot because that's what it mechanically does.If you can see a better way to accomplish this then by all means do, but this works.