vgel / treebender

A HDPSG-inspired symbolic natural language parser written in Rust
MIT License
43 stars 1 forks source link

Morphology support #1

Open vgel opened 3 years ago

vgel commented 3 years ago

Right now terminal tokens have to be separate words. Treebender should be able to support morphological rules:

V[ stem: t ] -> walk
V[ stem: t ] -> talk
// stem: f to block walkedededededededed...
V[ tense: past, stem: f ] -> V[ stem: t ] ++ ed  // syntax TBD

Questions:

Todo:

vgel commented 3 years ago

One way to approach this would actually be to just allow grammar files to define a token-splitting process that runs before parsing.

Something like:

$splitters = [
    /(.+)ed/ => [\1, -ed]
    /(.+)d/  =>  [\1, -ed] // for words like "baked"
    /(.+)s/  => [\1, -s]
    /(.+)es/ => [\1, -s]
]

Then all possible splitters would match on a word, plus an implicit "no expansion" splitter, and split a sentence into a bunch of possible morphological derivations:

"The dogs walked to the beach and baked" "The dogs walk -ed to the beach and baked" "The dogs walke -ed to the beach and baked" "The dog -s walked to the beach and baked" "The dog -s walk -ed to the beach and baked" "The dog -s walke -ed to the beach and baked" "The dogs walked to the beach and bak -ed" "The dogs walk -ed to the beach and bak -ed" "The dogs walke -ed to the beach and bak -ed" "The dog -s walked to the beach and bak -ed" "The dog -s walk -ed to the beach and bak -ed" "The dog -s walke -ed to the beach and bak -ed" "The dogs walked to the beach and bake -ed" "The dogs walk -ed to the beach and bake -ed" "The dogs walke -ed to the beach and bake -ed" "The dog -s walked to the beach and bake -ed" ==> "The dog -s walk -ed to the beach and bake -ed" "The dog -s walke -ed to the beach and bake -ed"

Obviously this has the potential to blow up, but we could also fail fast if a splitter generates a token that doesn't match any nonterminals in the grammar.