Closed aminatadjer closed 3 years ago
Hi, I can't reproduce the error. This error message indicts that there seems exist None during tokenizer processing. Would you please check if you are running the newest version of code with pytorch >= 1.4.0, transformers >= 2.5.0 and <= 4.0.0?
I have the same error
I have solved this bug. Updating transformers==2.5.0 to 4.0.0, it seems ok.
I reproduced this bug in text-to-code task. In my case, updating transformers does not work.
[Quick Solution]
1. Go to check the folder save/concode
2. Remove all the files begin with "train_blocksize_"
3. Rerun the training script
[Explaination]
I dived into details about the procedure and find that it is the preprocessing steps. The reason lies in reusing preprocessing files in "save/concode".
If there occurs error in generating preprocessed files in the beginning, like interpreting preprocessing steps in accident, you will get 2 (according to the default para) false files in "save/concode". Later, if you do not remove them, you will keep getting errors when calling "data = [self.dataset[idx] for idx in possibly_batched_index]" at "File "/home/at5262/CodeXGLUE/Text-Code/text-to-code/code/dataset.py", line 116, in getitem".
If you remove these false files and get preprocessing steps rerun completely, the correct preprocessed files will be generated and the whole script can run normally.
Hi, I am trying to run CodeGPT (for text to code task), I followed exactly the same steps but I am getting this error: Traceback (most recent call last): File "run.py", line 653, in
main()
File "run.py", line 640, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer, fh, pool)
File "run.py", line 165, in train
for step, (batch, token_labels) in enumerate(train_dataloader):
File "/home/at5262/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/at5262/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/at5262/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/at5262/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/at5262/CodeXGLUE/Text-Code/text-to-code/code/dataset.py", line 116, in getitem
return torch.tensor(self.inputs[item]), torch.tensor(self.token_labels[item])
RuntimeError: Could not infer dtype of NoneType
Traceback (most recent call last):
File "/home/at5262/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/at5262/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/at5262/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/at5262/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/at5262/anaconda3/bin/python', '-u', 'run.py', '--local_rank=0', '--data_dir=../dataset/concode', '--langs=java', '--output_dir=../save/concode', '--pretrain_dir=microsoft/CodeGPT-small-java-adaptedGPT2', '--log_file=text2code_concode.log', '--model_type=gpt2', '--block_size=512', '--do_train', '--node_index', '0', '--gpu_per_node', '1', '--learning_rate=5e-5', '--weight_decay=0.01', '--evaluate_during_training', '--per_gpu_train_batch_size=6', '--per_gpu_eval_batch_size=12', '--gradient_accumulation_steps=2', '--num_train_epochs=30', '--logging_steps=100', '--save_steps=5000', '--overwrite_output_dir', '--seed=42']' returned non-zero exit status 1.
I tried to run on Colab, and I am getting the same error