Closed ralphbrooks closed 5 years ago
@ralphbrooks this is incorrect – it splits an array sentences into individual InputExample
s. If you inspect an individual InputExample
's text_a
parameter you will find the whole sentence, and you can then see that using the tokenizer (e.g., tokenizer.tokenize(train_examples[0].text_a)
) correctly tokenizes the sentence.
In keras-bert.ipynb, I see the following:
It is believed that " ".join(text) actually splits the words into characters. This in turn causes BERT to tokenize based on character as opposed to the whole or partial word.