Currently I am reproducing Bert's model from scratch for educational purposes. I have finished building the model, but I have a question about preprocessing the data. Note that I am not using the same dataset, instead I am using the IMDB dataset. I try to emulate your approach as much as possible.
The case
I consider each review as a document, and I break each document down into sentences. The way the sentences are divided seems so crucial, I've decided to take the following approach:
In 10 percent of cases the maximum possible number of words is taken (256 words).
In 80 percent of cases it is divided by ., !,; or ?.
In 10 percent of cases, randomly.
def split_sentences(text, delimiters=".!?;", max_words=250):
# Split sentences based on maximum word count (10% of cases)
if random.random() < 0.1:
return split_text_by_maximum_word_count(text, max_words)
# Split sentences based on common punctuation marks (80% of cases)
if random.random() < 0.8:
return split_text_by_punctuation_marks(text, delimiters, max_words)
# Random splitting (10% of cases)
if random.random() < 0.1:
return random_splitting(text, max_words)
The question:
I would like to know if my approach is wrong, how did you separate the sentences in your approach?
Hi,
I am very impressed with your work on BERT.
Currently I am reproducing Bert's model from scratch for educational purposes. I have finished building the model, but I have a question about preprocessing the data. Note that I am not using the same dataset, instead I am using the IMDB dataset. I try to emulate your approach as much as possible.
The case
I consider each review as a document, and I break each document down into sentences. The way the sentences are divided seems so crucial, I've decided to take the following approach:
.
,!
,;
or?
.In 10 percent of cases, randomly.
The question:
I would like to know if my approach is wrong, how did you separate the sentences in your approach?
Thanks