" ".join(text) erroneously splits everything into characters

strongio / keras-bert

A simple technique to integrate BERT from tf hub to keras

258 stars 108 forks source link

" ".join(text) erroneously splits everything into characters #10

Closed ralphbrooks closed 5 years ago

ralphbrooks commented 5 years ago

In keras-bert.ipynb, I see the following:


def convert_text_to_examples(texts, labels):
    """Create InputExamples"""
    InputExamples = []
    for text, label in zip(texts, labels):
        InputExamples.append(
            InputExample(guid=None, text_a=" ".join(text), text_b=None, label=label)
        )
    return InputExamples

It is believed that " ".join(text) actually splits the words into characters. This in turn causes BERT to tokenize based on character as opposed to the whole or partial word.

jacobzweig commented 5 years ago

@ralphbrooks this is incorrect – it splits an array sentences into individual InputExamples. If you inspect an individual InputExample's text_a parameter you will find the whole sentence, and you can then see that using the tokenizer (e.g., tokenizer.tokenize(train_examples[0].text_a)) correctly tokenizes the sentence.