google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.23k stars 9.62k forks source link

Sentence Splitting Approach in BERT Preprocessing #1394

Open AliHaiderAhmad001 opened 1 year ago

AliHaiderAhmad001 commented 1 year ago

Hi,

I am very impressed with your work on BERT.

Currently I am reproducing Bert's model from scratch for educational purposes. I have finished building the model, but I have a question about preprocessing the data. Note that I am not using the same dataset, instead I am using the IMDB dataset. I try to emulate your approach as much as possible.

The case

I consider each review as a document, and I break each document down into sentences. The way the sentences are divided seems so crucial, I've decided to take the following approach:

  1. In 10 percent of cases the maximum possible number of words is taken (256 words).
  2. In 80 percent of cases it is divided by ., !,; or ?.
  3. In 10 percent of cases, randomly.

    def split_sentences(text, delimiters=".!?;", max_words=250):
    # Split sentences based on maximum word count (10% of cases)
    
    if random.random() < 0.1:
        return split_text_by_maximum_word_count(text, max_words)
    
    # Split sentences based on common punctuation marks (80% of cases)
    if random.random() < 0.8:
        return split_text_by_punctuation_marks(text, delimiters, max_words)
    
    # Random splitting (10% of cases)
    if random.random() < 0.1:
        return random_splitting(text, max_words)

The question:

I would like to know if my approach is wrong, how did you separate the sentences in your approach?

Thanks