mhulden / pyfoma

Python Finite-State Toolkit
Apache License 2.0
44 stars 8 forks source link

Equivalents for Foma’s DIR constraints #45

Open bluebear94 opened 3 weeks ago

bluebear94 commented 3 weeks ago

Is there an equivalent to specifying Foma’s DIR constraints (i.e. ||, \\, //, or \/) in PyFoma? I assume that it’s not implemented in PyFoma, but I want to make sure.

I am trying to use PyFoma to implement expanded deduplication rules for my constructed language Ŋarâþ Crîþ – currently, this only affects a limited number of consonants, but this is set to change to encompass a wider range. If you’re curious, you can check my work-in-progress implementation.

(Incidentally, Foma’s support for parallel rules would also be useful to have for this.)

bluebear94 commented 1 week ago

Here’s a version of rewritten that implements the DIR constraints (spelled as simultaneous, forward, backward, or outer):

def rewritten(fst: 'FST', *contexts, **flags) -> 'FST':
    """Returns a modified FST, rewriting self in contexts in parallel, controlled by flags."""
    order = flags.get('dir', 'simultaneous')
    defs = {'crossproducts': fst}
    defs['br'] = FST.re("'@<@'|'@>@'")
    defs['aux'] = FST.re(". - ($br|#)", defs)
    defs['dotted'] = FST.re(".*-(.* '@<@' '@>@' '@<@' '@>@' .*)")
    defs['base'] = FST.re("$dotted @ # ($aux | '@<@' $crossproducts '@>@')* #", defs)
    if len(contexts) > 0:
        center = FST.re("'@<@' (.-'@>@')* '@>@'")
        if order == 'simultaneous':
            lrpairs = ([l.ignore(defs['br']), r.ignore(defs['br'])] for l,r in contexts)
            defs['rule'] = center.context_restrict(*lrpairs, rewrite=True).compose(defs['base'])
        elif order == 'outer':
            lrpairs = ([l.ignore(defs['br']), r.ignore(defs['br'])] for l,r in contexts)
            defs['rule'] = defs['base'].compose(center.context_restrict(*lrpairs, rewrite=True))
        else:
            contexts = tuple(contexts)
            lpairs = [[l.ignore(defs['br']), FST.re(".*")] for l, _ in contexts]
            rpairs = [[FST.re(".*"), r.ignore(defs['br'])] for _, r in contexts]
            left = center.__copy__().context_restrict(*lpairs, rewrite=True)
            right = center.context_restrict(*rpairs, rewrite=True)
            if order == 'forward':
                defs['rule'] = right.compose(defs['base']).compose(left)
            elif order == 'backward':
                defs['rule'] = left.compose(defs['base']).compose(right)
            else:
                raise TypeError(f"dir must be simultaneous, forward, or backward (got {order})")
    else:
        defs['rule'] = defs['base']
    defs['remrewr'] = FST.re("'@<@':'' (.-'@>@')* '@>@':''") # worsener
    worseners = [FST.re(".* $remrewr (.|$remrewr)*", defs)]
    if flags.get('longest', False) == 'True':
        worseners.append(FST.re(".* '@<@' $aux+ '':('@>@' '@<@'?) $aux ($br:''|'':$br|$aux)* .*", defs))
    if flags.get('leftmost', False) == 'True':
        worseners.append(FST.re(\
             ".* '@<@':'' $aux+ ('':'@<@' $aux* '':'@>@' $aux+ '@>@':'' .* | '':'@<@' $aux* '@>@':'' $aux* '':'@>@' .*)", defs))
    if flags.get('shortest', False) == 'True':
        worseners.append(FST.re(".* '@<@' $aux* '@>@':'' $aux+ '':'@>@' .*", defs))
    defs['worsen'] = functools.reduce(lambda x, y: x.union(y), worseners).determinize_unweighted().minimize()
    defs['rewr'] = FST.re("$^output($^input($rule) @ $worsen)", defs)
    final = FST.re("(.* - $rewr) @ $rule", defs)
    newfst = final.map_labels({s:'' for s in ['@<@','@>@','#']}).epsilon_remove().determinize_as_dfa().minimize()
    return newfst

This still doesn’t handle parallel rules, though, so I’m going to keep thinking about the problem.

By the way, do you have a link to a paper explaining the particular approach to rewrite rules used in PyFoma? I’d like to read more about it. I think it might be “A new method for compiling parallel replace rules” by Yli-Jyrä, but the “full technical report” mentioned there is no longer online.

Edit: I think I now understand what $dotted is for (handling rewrite rules with empty inputs so $^rewrite('':a) turns xxx into axaxaxa, not something like aaaxaxaxa), but I’m still confused at how the worseners work.

Edit 2: I’ve found a bug with my implementation: $^rewrite2(a:b / _ a, dir=backward) with an input of aaa generates both aba and baa, while a -> b \\ _ a in FOMA generates only aba.