mhulden / foma

Automatically exported from code.google.com/p/foma
115 stars 90 forks source link

Add runtime options interface #116

Closed TinoDidriksen closed 3 years ago

TinoDidriksen commented 3 years ago

This PR adds code to fascilitate HFST using Foma directly ( https://github.com/hfst/hfst/issues/485#issuecomment-713512736 & https://github.com/hfst/hfst/tree/foma ).

sigmasizes+3 to sigmasizes+30 is trivial - HFST just needs more memory sometimes, and this must be relevant for other users.

But the bigger change is adding API fsm_set_option() and fsm_get_option() because HFST needs to disable a piece of Foma's code. And I figured if I'm adding a way to set one option, I might as well make it extensible for potential future options.

The option should maybe be renamed - I don't know what the code semantically does, so I named it dot-hash-dot because that's what it mechanically does.

If you can see a better way to accomplish this then by all means do, but this works.

AmbientLighter commented 3 years ago

@TinoDidriksen dot-hash-dot stands for word boundary symbol https://github.com/mhulden/foma/blob/master/foma/docs/simpleintro.md#word-boundaries-in-rewrite-rules IIUC This code adds word boundary symbol to sigma if it doesn't contain it yet. Therefore I'd suggest to rename it to skip_merge_dothashdot to skip_word_boundary_marker

The word-boundary marker has no special meaning in other regular expression constructs; however, any .#.-symbols found in a context specification will always be interpreted as a word boundary, and the symbol is removed from the alphabet at the end of replacement rule or context restriction compilation.

See also:

Word boundary marker is being inserted to alphabet in HFST as well when rules are applied

TinoDidriksen commented 3 years ago

Renamed.

AmbientLighter commented 3 years ago

As about segfault related to sigmasize overflow, it looks like 30 magic number is arbitrary. 3 is used because there are exactly 3 special symbols on arcs It may work for large enough sigmas until we hit even bigger sigma. Looking at its usage, it looks like there's sigma->number which exceeds 3 and may exceed 30. Therefore we may find maximum value of sigma->number of both net1 and net2 and use that number instead of 3 or 30.

In other words, let's check if the following change does the trick:

sigmasizes = sigma_max(net1->sigma) + sigma_max(net2->sigma) + 3

If it is possible, it would be best to verify solution on some real world dataset which is known to fail with current Foma.

AmbientLighter commented 3 years ago

Also I've created related pull requests which attempt to fix this bug:

AmbientLighter commented 3 years ago

@TinoDidriksen Could you please fix merge conflicts?

TinoDidriksen commented 3 years ago

Done.

AmbientLighter commented 3 years ago

@mhulden This pull request looks good to me. What do you think?

mhulden commented 3 years ago

Not currently passing build checks. Fixable?

TinoDidriksen commented 3 years ago

It had come out of sync with the HFST changes - try now.