Open vgel opened 4 years ago
One way to approach this would actually be to just allow grammar files to define a token-splitting process that runs before parsing.
Something like:
$splitters = [
/(.+)ed/ => [\1, -ed]
/(.+)d/ => [\1, -ed] // for words like "baked"
/(.+)s/ => [\1, -s]
/(.+)es/ => [\1, -s]
]
Then all possible splitters would match on a word, plus an implicit "no expansion" splitter, and split a sentence into a bunch of possible morphological derivations:
"The dogs walked to the beach and baked" "The dogs walk -ed to the beach and baked" "The dogs walke -ed to the beach and baked" "The dog -s walked to the beach and baked" "The dog -s walk -ed to the beach and baked" "The dog -s walke -ed to the beach and baked" "The dogs walked to the beach and bak -ed" "The dogs walk -ed to the beach and bak -ed" "The dogs walke -ed to the beach and bak -ed" "The dog -s walked to the beach and bak -ed" "The dog -s walk -ed to the beach and bak -ed" "The dog -s walke -ed to the beach and bak -ed" "The dogs walked to the beach and bake -ed" "The dogs walk -ed to the beach and bake -ed" "The dogs walke -ed to the beach and bake -ed" "The dog -s walked to the beach and bake -ed" ==> "The dog -s walk -ed to the beach and bake -ed" "The dog -s walke -ed to the beach and bake -ed"
Obviously this has the potential to blow up, but we could also fail fast if a splitter generates a token that doesn't match any nonterminals in the grammar.
Right now terminal tokens have to be separate words. Treebender should be able to support morphological rules:
Questions:
Todo: