luyug / Dense

A toolkit for building dense retrievers with deep language models.
Apache License 2.0
52 stars 4 forks source link

can not load training dataset #6

Open countback opened 3 years ago

countback commented 3 years ago

following the introduction and requirements, Fine-tuning a retrieval mode based on "Luyu/co-condenser-marco", but loading the training data error

python -m dense.driver.train --output_dir ./retriever_model_s1 --model_name_or_path Luyu/co-condenser-marco --do_train --save_steps 20000 --train_dir marco/bert/train/ --fp16 --per_device_train_batch_size 8 --learning_rate 5e-6 --num_train_epochs 3 --dataloader_num_workers 8

` 09/19/2021 23:01:50 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 3, distributed training: False, 16-bits training: True 09/19/2021 23:01:50 - INFO - main - Training/evaluation parameters DenseTrainingArguments(output_dir='./retriever_model_s1', overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-06, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=0, logging_dir='runs/Sep19_23-01-50_bach-gpu-8v-011028229099.na620', logging_first_step=False, logging_steps=500, save_steps=20000, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=8, past_index=-1, run_name='./retriever_model_s1', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, warmup_ratio=0.1, negatives_x_device=False, do_encode=False, grad_cache=False, gc_q_chunk_size=4, gc_p_chunk_size=32) 09/19/2021 23:01:50 - INFO - main - MODEL parameters ModelArguments(model_name_or_path='Luyu/co-condenser-marco', target_model_path=None, config_name=None, tokenizer_name=None, cache_dir=None, untie_encoder=False, add_pooler=False, projection_in_dim=768, projection_out_dim=768) Some weights of BertModel were not initialized from the model checkpoint at Luyu/co-condenser-marco and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Using custom data configuration default download_config: None download_mode: None Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/countback/.cache/huggingface/datasets/json/default/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514...

▽ Traceback (most recent call last): File "/home/countback/anaconda3/envs/dense/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/countback/anaconda3/envs/dense/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/dense/driver/train.py", line 104, in main() File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/dense/driver/train.py", line 80, in main data_args, data_args.train_dir, tokenizer, File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/dense/data.py", line 32, in init ignore_verifications=False, File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/datasets/load.py", line 610, in load_dataset ignore_verifications=ignore_verifications, File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/datasets/builder.py", line 517, in download_and_prepare dl_manager=dl_manager, verify_infos=verify_infos, download_and_prepare_kwargs File "/home/countback/anaconda3/envs/dense/lib/python3.7/site-packages/datasets/builder.py", line 572, in _download_and_prepare split_generators = self._split_generators(dl_manager, split_generators_kwargs) File "/home/countback/.cache/huggingface/modules/datasets_modules/datasets/json/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514/json.py", line 45, in _split_generators raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}") ValueError: At least one data file must be specified, but got data_files=None `

MXueguang commented 3 years ago

Hi @countback, could you double-check if you are using the latest code here (and rebuild)? https://github.com/luyug/Dense/blob/319ef30e3efa6658fdbffce4f2844998d86b75ac/src/dense/data.py#L30 We have changed data_files into data_dir. the traceback seems still using data_files.

countback commented 3 years ago

Hi @countback, could you double-check if you are using the latest code here (and rebuild)? https://github.com/luyug/Dense/blob/319ef30e3efa6658fdbffce4f2844998d86b75ac/src/dense/data.py#L30

We have changed data_files into data_dir. the traceback seems still using data_files.

yes, I'm using the latest code, the same error happens

luyug commented 3 years ago

Run tree marco and paste the output here. Maybe there's a problem with the data directory's structure.

nickyongzhang commented 2 years ago

I am running the msmarco-passage-ranking example and see the same error. I have done pip uninstall dense && pip install --editable . and still see the error.

My env info

torch==1.9.0
datasets==1.11.0
faiss-cpu==1.7.1.post2
transformers==4.10.0

My marco directory looks like

marco
├── bert
│   ├── corpus
│   │   ├── split00.json
│   │   ├── split01.json
│   │   ├── split02.json
│   │   ├── split03.json
│   │   ├── split04.json
│   │   ├── split05.json
│   │   ├── split06.json
│   │   ├── split07.json
│   │   ├── split08.json
│   │   └── split09.json
│   ├── query
│   │   └── dev.query.json
│   └── train
│       ├── split00.json
│       ├── split01.json
│       ├── split02.json
│       ├── split03.json
│       ├── split04.json
│       ├── split05.json
│       ├── split06.json
│       ├── split07.json
│       └── split08.json
├── corpus.tsv
├── dev.query.txt
├── para.title.txt
├── para.txt
├── qidpidtriples.train.full.2.tsv
├── qrels.dev.tsv
├── qrels.train.addition.tsv
├── qrels.train.tsv
├── train.negatives.tsv
└── train.query.txt

4 directories, 30 files

Hope you can help.

nickyongzhang commented 2 years ago

I am able to solve this error by updating the datasets package to 1.12.0 and make the following change

self.train_data = datasets.load_dataset(
    'json',
    data_files=os.path.join(path_to_data, "*.json"),
    ignore_verifications=False,
)['train']

By the way, after solving the data loading error, I came across with the following error.

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/Dense/src/dense/driver/train.py", line 106, in <module>
    main()
  File "/home/ubuntu/Dense/src/dense/driver/train.py", line 99, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/transformers/trainer.py", line 1284, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/ubuntu/Dense/src/dense/trainer.py", line 65, in training_step
    return super(DenseTrainer, self).training_step(*args) / self._dist_loss_scale_factor
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/transformers/trainer.py", line 1787, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/Dense/src/dense/trainer.py", line 62, in compute_loss
    return model(query=query, passage=passage).loss
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/Dense/src/dense/modeling.py", line 107, in forward
    q_hidden, q_reps = self.encode_query(query)
  File "/home/ubuntu/Dense/src/dense/modeling.py", line 173, in encode_query
    qry_out = self.lm_q(**qry, return_dict=True)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 988, in forward
    past_key_values_length=past_key_values_length,
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 215, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/ubuntu/miniconda3/envs/dense/lib/python3.7/site-packages/torch/nn/functional.py", line 2043, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument index in method wrapper_index_select)

It turns out to be a bug related with the datatype of batch data. The transformers.trainer._prepare_input only check dict, tuple, list, and torch.Tensor, otherwise return data without casting to args.device, but the encoded batch has type BatchEncoding resulting the prepared data still reside in CPU.

I change the line https://github.com/luyug/Dense/blob/main/src/dense/trainer.py#L43 to prepared.append(super()._prepare_inputs(x.data)) and solve the above error. It may not be the best fix, please leave your comment if you have better fix.

MXueguang commented 2 years ago

Hi @countback @nickyongzhang, Thanks for catching the issue. We fixed the issue in https://github.com/luyug/Dense/pull/7.

@nickyongzhang we reproduced your second issue with torch1.9, transformers4.10.0, however, there is no such error for transformers<=4.9.0. We currently support up to 4.9.0. But thank you for let us know the potential way to fit for latest transformers, we will take a look.

dhh-chyc commented 2 months ago

When I tried to load the training dataset according to the Data Format instructions in README.md, I encountered the following error. How should I modify it?

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/content/Dense/src/dense/driver/train.py", line 104, in <module>
    main()
  File "/content/Dense/src/dense/driver/train.py", line 79, in main
    train_dataset = TrainDataset(
  File "/content/Dense/src/dense/data.py", line 26, in __init__
    self.train_data = datasets.load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 607, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 514, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 592, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 942, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/json/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514/json.py", line 87, in _generate_tables
    raise ValueError(
ValueError: Not able to read records in the JSON file at ../data/train/43.json. You should probably indicate the field of the JSON file containing your records. This JSON file contain the following fields: ['query', 'positives', 'negatives']. Select the correct one and provide it as `field='XXX'` to the `load_dataset` method.