Inputs shoult be tokenized only for training/evaluation sets?

neulab / awesome-align

A neural word aligner based on multilingual BERT

BSD 3-Clause "New" or "Revised" License

325 stars 47 forks source link

Hello,

Your README states:

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Is this the case only for the training set and the optional evaluation set? During inference/prediction, do we also need to pass the source/target pair, pretokenized? Your demo uses the example:

src = 'awesome-align is awesome !' tgt = '牛对齐是牛！'

where the ! is pretokenized, as there is a space between it and the previous word ("awesome" in this case). Also, does this requirement stem from the original mBERT or is this your implementation requirement? Thank you!

neulab / awesome-align

Inputs shoult be tokenized only for training/evaluation sets? #54