Open stelmath opened 1 year ago
You need to provide tokenized sentences to the model because otherwise the model will treat "awesome!" and "牛!" as single words and you will get "awesome!-牛!" as an aligned word pair. This follows the previous fast-align work (https://github.com/clab/fast_align).
Hello,
Your README states:
Is this the case only for the training set and the optional evaluation set? During inference/prediction, do we also need to pass the source/target pair, pretokenized? Your demo uses the example:
where the
!
is pretokenized, as there is a space between it and the previous word ("awesome" in this case). Also, does this requirement stem from the original mBERT or is this your implementation requirement? Thank you!