Open Q-lds opened 4 years ago
I have the same error. Trying to run the LM example on a new dataset (Language model training from scratch), and I get the same error.
So, in my case the issue was that the text .... was too small.
Basically line 137 in fast_bert/data_lm.py
while len(tokenized_text) >= block_size: # Truncate in block of block_size
self.examples.append(
tokenizer.build_inputs_with_special_tokens(
tokenized_text[:block_size]
)
)
tokenized_text = tokenized_text[block_size:]
never execute if len(tokenized_text)
is smaller than the given block size.
Bear in mind the process may also take a really long time, since it runs on a single core. In my case it ended up being 14 hours :D
I used a large raw corpus and got the same error. I tested the same raw corpus using run_language_modeling.py
from the transformers library, and I got the same error. My solution was to set up the block-size equal to my maximum length for a sentence; in this case, I was using 128.:
!python run_language_modeling.py \ --train_data_file=/home/ubuntu/data/sedi1_full.txt \ --output_dir=./tmp/ \ --model_type=bert \ --model_name_or_path=/home/ubuntu/tmp/bertcase_torch/ \ --mlm \ --block_size=128\ --do_train \ --eval_all_checkpoints \ --save_steps=100000
I didn't find how to set-up the block_size on fast_bert language databunch. For now, I am going to use transformers solution.
I had the same problem, but i just changed f.write(text+'\n') to replace f.write(text) in the code data_lm.py, then it is ok.
I had the same problem, but i just changed f.write(text+'\n') to replace f.write(text) in the code data_lm.py, then it is ok.
@ninasujit2016 what if I change this and still get the same error?
Facing the same error. Any fixes yet?
Clear your cache ! This function silently uses cache if available, totally ignoring the data you pass as input. In my case, creating the whole dataset was too slow, so I tried to pass just a few lines of text, which created an empty dataset in my cache (because only a few lines of text is too small). Then, I got this error whatever data I used, until I cleared the cache.
I strongly recommend to activate the 'info' logging, as follow, so that you see whether the function uses cache or not.
logger.setLevel('INFO')
consoleHandler = logging.StreamHandler()
consoleHandler.setLevel(logging.INFO)
logger.addHandler(consoleHandler)
By the way, I consider that this is a bug. Calling BertLMDataBunch.from_raw_corpus
should never read from the cache.
I am trying to fine tune a model, but I am encountering a ValueError when creating the dataBunch from the raw corpus.
With the following syntactic data :
I get the following ValueError
The intermediate files lm_train.txt and lm_val.txt are created, so I suspect something is going wrong at the level of the tokenizer.
My env has python 3.7.6 and contains
Anyway, let me know if you need any further information from my side!