Sentence Splitting Approach in BERT Preprocessing

Hi,

I am very impressed with your work on BERT.

Currently I am reproducing Bert's model from scratch for educational purposes. I have finished building the model, but I have a question about preprocessing the data. Note that I am not using the same dataset, instead I am using the IMDB dataset. I try to emulate your approach as much as possible.

The case

I consider each review as a document, and I break each document down into sentences. The way the sentences are divided seems so crucial, I've decided to take the following approach:

In 10 percent of cases the maximum possible number of words is taken (256 words).
In 80 percent of cases it is divided by ., !,; or ?.

In 10 percent of cases, randomly.

def split_sentences(text, delimiters=".!?;", max_words=250):
# Split sentences based on maximum word count (10% of cases)

if random.random() < 0.1:
    return split_text_by_maximum_word_count(text, max_words)

# Split sentences based on common punctuation marks (80% of cases)
if random.random() < 0.8:
    return split_text_by_punctuation_marks(text, delimiters, max_words)

# Random splitting (10% of cases)
if random.random() < 0.1:
    return random_splitting(text, max_words)

The question:

I would like to know if my approach is wrong, how did you separate the sentences in your approach?

Thanks

google-research / bert

Sentence Splitting Approach in BERT Preprocessing #1394

The case

The question: