Open countback opened 3 years ago
Hi @countback, could you double-check if you are using the latest code here (and rebuild)?
https://github.com/luyug/Dense/blob/319ef30e3efa6658fdbffce4f2844998d86b75ac/src/dense/data.py#L30
We have changed data_files
into data_dir
. the traceback seems still using data_files
.
Hi @countback, could you double-check if you are using the latest code here (and rebuild)? https://github.com/luyug/Dense/blob/319ef30e3efa6658fdbffce4f2844998d86b75ac/src/dense/data.py#L30
We have changed
data_files
intodata_dir
. the traceback seems still usingdata_files
.
yes, I'm using the latest code, the same error happens
Run tree marco
and paste the output here. Maybe there's a problem with the data directory's structure.
I am running the msmarco-passage-ranking
example and see the same error. I have done pip uninstall dense && pip install --editable .
and still see the error.
My env info
torch==1.9.0
datasets==1.11.0
faiss-cpu==1.7.1.post2
transformers==4.10.0
My marco directory looks like
marco
├── bert
│ ├── corpus
│ │ ├── split00.json
│ │ ├── split01.json
│ │ ├── split02.json
│ │ ├── split03.json
│ │ ├── split04.json
│ │ ├── split05.json
│ │ ├── split06.json
│ │ ├── split07.json
│ │ ├── split08.json
│ │ └── split09.json
│ ├── query
│ │ └── dev.query.json
│ └── train
│ ├── split00.json
│ ├── split01.json
│ ├── split02.json
│ ├── split03.json
│ ├── split04.json
│ ├── split05.json
│ ├── split06.json
│ ├── split07.json
│ └── split08.json
├── corpus.tsv
├── dev.query.txt
├── para.title.txt
├── para.txt
├── qidpidtriples.train.full.2.tsv
├── qrels.dev.tsv
├── qrels.train.addition.tsv
├── qrels.train.tsv
├── train.negatives.tsv
└── train.query.txt
4 directories, 30 files
Hope you can help.
I am able to solve this error by updating the datasets
package to 1.12.0 and make the following change
self.train_data = datasets.load_dataset(
'json',
data_files=os.path.join(path_to_data, "*.json"),
ignore_verifications=False,
)['train']
By the way, after solving the data loading error, I came across with the following error.
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/Dense/src/dense/driver/train.py", line 106, in <module>
main()
File "/home/ubuntu/Dense/src/dense/driver/train.py", line 99, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/transformers/trainer.py", line 1284, in train
tr_loss += self.training_step(model, inputs)
File "/home/ubuntu/Dense/src/dense/trainer.py", line 65, in training_step
return super(DenseTrainer, self).training_step(*args) / self._dist_loss_scale_factor
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/transformers/trainer.py", line 1787, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ubuntu/Dense/src/dense/trainer.py", line 62, in compute_loss
return model(query=query, passage=passage).loss
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Dense/src/dense/modeling.py", line 107, in forward
q_hidden, q_reps = self.encode_query(query)
File "/home/ubuntu/Dense/src/dense/modeling.py", line 173, in encode_query
qry_out = self.lm_q(**qry, return_dict=True)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 988, in forward
past_key_values_length=past_key_values_length,
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 215, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 160, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/functional.py", line 2043, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument index in method wrapper_index_select)
It turns out to be a bug related with the datatype of batch data. The transformers.trainer._prepare_input only check dict
, tuple
, list
, and torch.Tensor
, otherwise return data without casting to args.device
, but the encoded batch has type BatchEncoding
resulting the prepared data still reside in CPU.
I change the line https://github.com/luyug/Dense/blob/main/src/dense/trainer.py#L43 to prepared.append(super()._prepare_inputs(x.data))
and solve the above error. It may not be the best fix, please leave your comment if you have better fix.
Hi @countback @nickyongzhang, Thanks for catching the issue. We fixed the issue in https://github.com/luyug/Dense/pull/7.
@nickyongzhang we reproduced your second issue with torch1.9, transformers4.10.0, however, there is no such error for transformers<=4.9.0. We currently support up to 4.9.0. But thank you for let us know the potential way to fit for latest transformers, we will take a look.
When I tried to load the training dataset according to the Data Format instructions in README.md, I encountered the following error. How should I modify it?
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/content/Dense/src/dense/driver/train.py", line 104, in <module>
main()
File "/content/Dense/src/dense/driver/train.py", line 79, in main
train_dataset = TrainDataset(
File "/content/Dense/src/dense/data.py", line 26, in __init__
self.train_data = datasets.load_dataset(
File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 607, in load_dataset
builder_instance.download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 514, in download_and_prepare
self._download_and_prepare(
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 592, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 942, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1133, in __iter__
for obj in iterable:
File "/root/.cache/huggingface/modules/datasets_modules/datasets/json/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514/json.py", line 87, in _generate_tables
raise ValueError(
ValueError: Not able to read records in the JSON file at ../data/train/43.json. You should probably indicate the field of the JSON file containing your records. This JSON file contain the following fields: ['query', 'positives', 'negatives']. Select the correct one and provide it as `field='XXX'` to the `load_dataset` method.
following the introduction and requirements, Fine-tuning a retrieval mode based on "Luyu/co-condenser-marco", but loading the training data error
python -m dense.driver.train --output_dir ./retriever_model_s1 --model_name_or_path Luyu/co-condenser-marco --do_train --save_steps 20000 --train_dir marco/bert/train/ --fp16 --per_device_train_batch_size 8 --learning_rate 5e-6 --num_train_epochs 3 --dataloader_num_workers 8
` 09/19/2021 23:01:50 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 3, distributed training: False, 16-bits training: True 09/19/2021 23:01:50 - INFO - main - Training/evaluation parameters DenseTrainingArguments(output_dir='./retriever_model_s1', overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-06, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=0, logging_dir='runs/Sep19_23-01-50_bach-gpu-8v-011028229099.na620', logging_first_step=False, logging_steps=500, save_steps=20000, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=8, past_index=-1, run_name='./retriever_model_s1', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, warmup_ratio=0.1, negatives_x_device=False, do_encode=False, grad_cache=False, gc_q_chunk_size=4, gc_p_chunk_size=32) 09/19/2021 23:01:50 - INFO - main - MODEL parameters ModelArguments(model_name_or_path='Luyu/co-condenser-marco', target_model_path=None, config_name=None, tokenizer_name=None, cache_dir=None, untie_encoder=False, add_pooler=False, projection_in_dim=768, projection_out_dim=768) Some weights of BertModel were not initialized from the model checkpoint at Luyu/co-condenser-marco and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Using custom data configuration default download_config: None download_mode: None Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/countback/.cache/huggingface/datasets/json/default/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514...
▽ Traceback (most recent call last): File "/home/countback/anaconda3/envs/dense/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/countback/anaconda3/envs/dense/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/dense/driver/train.py", line 104, in
main()
File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/dense/driver/train.py", line 80, in main
data_args, data_args.train_dir, tokenizer,
File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/dense/data.py", line 32, in init
ignore_verifications=False,
File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/datasets/load.py", line 610, in load_dataset
ignore_verifications=ignore_verifications,
File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/datasets/builder.py", line 517, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, download_and_prepare_kwargs
File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/datasets/builder.py", line 572, in _download_and_prepare
split_generators = self._split_generators(dl_manager, split_generators_kwargs)
File "/home/countback/.cache/huggingface/modules/datasets_modules/datasets/json/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514/json.py", line 45, in _split_generators
raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}")
ValueError: At least one data file must be specified, but got data_files=None
`