In QueryTokenizer.tensorize() and DocTokenizer.tensorize(), appending '. ' to the beginning of each string assumes that the period character will always be tokenized into one ID. This is not always the case with tokenizers where extra superfluous IDs will be added. Specifically (in QueryTokenizer.tensorize()):
In QueryTokenizer.tensorize() and DocTokenizer.tensorize(), appending '. ' to the beginning of each string assumes that the period character will always be tokenized into one ID. This is not always the case with tokenizers where extra superfluous IDs will be added. Specifically (in QueryTokenizer.tensorize()):
batch_text = ['. ' + x for x in batch_text]
and then
ids[:, 1] = self.Q_marker_token_id