neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
325 stars 47 forks source link

Inputs shoult be tokenized only for training/evaluation sets? #54

Open stelmath opened 1 year ago

stelmath commented 1 year ago

Hello,

Your README states:

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Is this the case only for the training set and the optional evaluation set? During inference/prediction, do we also need to pass the source/target pair, pretokenized? Your demo uses the example:

src = 'awesome-align is awesome !' tgt = '牛对齐 是 牛 !'

where the ! is pretokenized, as there is a space between it and the previous word ("awesome" in this case). Also, does this requirement stem from the original mBERT or is this your implementation requirement? Thank you!

zdou0830 commented 1 year ago

You need to provide tokenized sentences to the model because otherwise the model will treat "awesome!" and "牛!" as single words and you will get "awesome!-牛!" as an aligned word pair. This follows the previous fast-align work (https://github.com/clab/fast_align).