hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
35.19k stars 4.35k forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #323

Closed Lan105753035 closed 1 year ago

Lan105753035 commented 1 year ago

Faced this error while using custom dataset.

Traceback (most recent call last):
File "/workspace/LLaMA-Efficient-Tuning/src/train_bash.py", line 23, in <module>
    main()
File "/workspace/LLaMA-Efficient-Tuning/src/train_bash.py", line 10, in main
    run_sft(model_args, data_args, training_args, finetuning_args)
File "/workspace/LLaMA-Efficient-Tuning/src/llmtuner/tuner/sft/workflow.py", line 27, in run_sft
    dataset = get_dataset(model_args, data_args)
File "/workspace/LLaMA-Efficient-Tuning/src/llmtuner/dsets/loader.py", line 78, in get_dataset
    dataset = load_dataset(
File "/opt/conda/envs/llama/lib/python3.10/site-packages/datasets/load.py", line 2133, in load_dataset
    builder_instance.download_and_prepare(
File "/opt/conda/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
File "/opt/conda/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
File "/opt/conda/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1813, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
File "/opt/conda/envs/llama/lib/python3.10/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30105) of binary: /opt/conda/envs/llama/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/llama/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/llama/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/conda/envs/llama/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/llama/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Lan105753035 commented 1 year ago

haved checked the dataset_info and custom dataset

instruction:"Complete the following paragraph: "
input:"This invention relates to novel compounds suitable for labelling or already labelled by18F, methods of preparing such a compound, compositions comprising such compounds, kits comprising such compounds or compositions and uses of such compounds, compositions or kits for diagnostic imaging by positron emission "
output:"This invention relates to novel compounds suitable for labelling or already labelled by18F, methods of preparing such a compound, compositions comprising such compounds, kits comprising such compounds or compositions and uses of such compounds, compositions or kits for diagnostic imaging by positron emission tomography (PET).</s>"

file_name:"test.json"
prompt:"instruction"
query:"input"
response:"output"