utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.85k stars 342 forks source link

BertLMDataBunch.from_raw_corpus : `ValueError: num_samples should be a positive integer value, but got num_samples=0` #181

Open Q-lds opened 4 years ago

Q-lds commented 4 years ago

I am trying to fine tune a model, but I am encountering a ValueError when creating the dataBunch from the raw corpus.

With the following syntactic data :

text_list = ['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.',
             'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in',
             'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
             ]

databunch_lm = BertLMDataBunch.from_raw_corpus(
    data_dir=DATA_PATH,
    text_list=text_list,
    tokenizer='bert-base-uncased',
    batch_size_per_gpu=16,
    max_seq_length=128,
    multi_gpu=True, 
    model_type='bert',
    logger=logger)

I get the following ValueError

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>

~/envs/my_env/lib/python3.7/site-packages/fast_bert/data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
    198             logger=logger,
    199             clear_cache=clear_cache,
--> 200             no_cache=no_cache,
    201         )
    202 

~/envs/my_env/lib/python3.7/site-packages/fast_bert/data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
    275             self.train_batch_size = self.batch_size_per_gpu * max(1, self.n_gpu)
    276 
--> 277             train_sampler = RandomSampler(train_dataset)
    278             self.train_dl = DataLoader(
    279                 train_dataset, sampler=train_sampler, batch_size=self.train_batch_size

~/envs/my_env/lib/python3.7/site-packages/torch/utils/data/sampler.py in __init__(self, data_source, replacement, num_samples)
     92         if not isinstance(self.num_samples, int) or self.num_samples <= 0:
     93             raise ValueError("num_samples should be a positive integer "
---> 94                              "value, but got num_samples={}".format(self.num_samples))
     95 
     96     @property

ValueError: num_samples should be a positive integer value, but got num_samples=0

The intermediate files lm_train.txt and lm_val.txt are created, so I suspect something is going wrong at the level of the tokenizer.

My env has python 3.7.6 and contains

pytorch-lamb              1.0.0                    pypi_0    pypi
torch                     1.4.0                    pypi_0    pypi
torchvision               0.5.0                    pypi_0    pypi
fast-bert                 1.6.2                    pypi_0    pypi
tokenizers                0.5.2                    pypi_0    pypi
transformers              2.5.1                    pypi_0    pypi

Anyway, let me know if you need any further information from my side!

ddofer commented 4 years ago

I have the same error. Trying to run the LM example on a new dataset (Language model training from scratch), and I get the same error.

Q-lds commented 4 years ago

So, in my case the issue was that the text .... was too small.

Basically line 137 in fast_bert/data_lm.py

            while len(tokenized_text) >= block_size:  # Truncate in block of block_size

                self.examples.append(
                    tokenizer.build_inputs_with_special_tokens(
                        tokenized_text[:block_size]
                    )
                )
                tokenized_text = tokenized_text[block_size:]

never execute if len(tokenized_text) is smaller than the given block size.

Bear in mind the process may also take a really long time, since it runs on a single core. In my case it ended up being 14 hours :D

walterwsmf commented 4 years ago

I used a large raw corpus and got the same error. I tested the same raw corpus using run_language_modeling.py from the transformers library, and I got the same error. My solution was to set up the block-size equal to my maximum length for a sentence; in this case, I was using 128.:

!python run_language_modeling.py \ --train_data_file=/home/ubuntu/data/sedi1_full.txt \ --output_dir=./tmp/ \ --model_type=bert \ --model_name_or_path=/home/ubuntu/tmp/bertcase_torch/ \ --mlm \ --block_size=128\ --do_train \ --eval_all_checkpoints \ --save_steps=100000

I didn't find how to set-up the block_size on fast_bert language databunch. For now, I am going to use transformers solution.

ninasujit2016 commented 4 years ago

I had the same problem, but i just changed f.write(text+'\n') to replace f.write(text) in the code data_lm.py, then it is ok.

krannnn commented 4 years ago

I had the same problem, but i just changed f.write(text+'\n') to replace f.write(text) in the code data_lm.py, then it is ok.

@ninasujit2016 what if I change this and still get the same error?

joshcx commented 4 years ago

Facing the same error. Any fixes yet?

godefv commented 3 years ago

Clear your cache ! This function silently uses cache if available, totally ignoring the data you pass as input. In my case, creating the whole dataset was too slow, so I tried to pass just a few lines of text, which created an empty dataset in my cache (because only a few lines of text is too small). Then, I got this error whatever data I used, until I cleared the cache.

I strongly recommend to activate the 'info' logging, as follow, so that you see whether the function uses cache or not.

logger.setLevel('INFO')
consoleHandler = logging.StreamHandler()
consoleHandler.setLevel(logging.INFO)
logger.addHandler(consoleHandler)

By the way, I consider that this is a bug. Calling BertLMDataBunch.from_raw_corpus should never read from the cache.