stickeritis / sticker

Succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot
Other
25 stars 2 forks source link

Add EditTreeEncoder #176

Closed danieldk closed 4 years ago

danieldk commented 4 years ago

This encoder encodes lemmas as edit trees that can be applied to a form (using Tobias Pütz' edit_tree crate).

Decoding consists of (attempting to) apply the edit tree to a form.


This turned out to be easy to do. Random comments:

twuebi commented 4 years ago

Before merging, it would be nice if there was a release of edit_tree on crates.io, so that we do not have to rely on a git version. In edit_tree, the name TreeNode is very generic. Maybe it would be clearer to rename this type to EditTree? (Each subtree is also an edit tree of some sort.)

Sure.

Decoding consists of (attempting to) apply the edit tree to a form.

My first impression was that falling back to the form instead of falling back to other edit-trees works better, didn't do exhaustive experiments though.

When words are to long (longer than 40 characters by default), any character beyond the maximum length is cut off. For lemmatization, it would probably be nicer to do something more sophisticated (in a separate PR).

In my rust seq2seq lemmatizer, I gave the form as lemma for words above a certain length, most of these words (IIRC mostly in lassy) were links or other non-inflected things. In other, not that extreme, cases giving an option to cut the middle of the word may be an option.

danieldk commented 4 years ago

My first impression was that falling back to the form instead of falling back to other edit-trees works better, didn't do exhaustive experiments though.

That sounds reasonable! Now nothing is done, which is probably the worst strategy (no lemma). I'll update the PR and make it configurable to do one of:

  1. Nothing
  2. Use the form
  3. Try the next label of the top-k labels (use the form when none works?)

In my rust seq2seq lemmatizer, I gave the form as lemma for words above a certain length, most of these words (IIRC mostly in lassy) were links or other non-inflected things. In other, not that extreme, cases giving an option to cut the middle of the word may be an option.

Both seem reasonable!

twuebi commented 4 years ago

Before merging, it would be nice if there was a release of edit_tree on crates.io, so that we do not have to rely on a git version.

https://crates.io/crates/edit_tree