How to train with the Bible content?

paulocoutinhox commented 1 year ago

Hi,

What is the steps to train it with this specific Bible content?

Example: https://raw.githubusercontent.com/tushortz/variety-bible-text/master/bibles/kjv.txt

Can you show me the steps to train it?

And the other question is: The final file is compatible with LLAMA?

Thanks.

paulocoutinhox commented 1 year ago

Also, after try run the code in this repo i get error:

torchrun --nproc_per_node=4 --master_port=12345 train.py \
    --model_name_or_path /Users/paulo/Developer/workspaces/cpp/llama.cpp/models/7B  \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir ./out-new \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

Error:

NOTE: Redirects are currently not supported in Windows or MacOs.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "/Users/paulo/Downloads/stanford_alpaca-main/train.py", line 232, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/paulo/Downloads/stanford_alpaca-main/train.py", line 232, in <module>
  File "/Users/paulo/Downloads/stanford_alpaca-main/train.py", line 232, in <module>
Traceback (most recent call last):
  File "/Users/paulo/Downloads/stanford_alpaca-main/train.py", line 232, in <module>
    train()
  File "/Users/paulo/Downloads/stanford_alpaca-main/train.py", line 194, in train
    train()    
train()
  File "/Users/paulo/Downloads/stanford_alpaca-main/train.py", line 194, in train
  File "/Users/paulo/Downloads/stanford_alpaca-main/train.py", line 194, in train
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
      File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 110, in __init__
    train()
  File "/Users/paulo/Downloads/stanford_alpaca-main/train.py", line 194, in train
  File "/U    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
sers/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1172, in __post_init__
        obj = dtype(**inputs)obj = dtype(**inputs)

  File "<string>", line 110, in __init__
  File "<string>", line 110, in __init__
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1172, in __post_init__
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1172, in __post_init__
    obj = dtype(**inputs)
  File "<string>", line 110, in __init__
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1172, in __post_init__
    and (self.device.type != "cuda")
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1556, in device
    and (self.device.type != "cuda")
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1556, in device
    and (self.device.type != "cuda")
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1556, in device
    and (self.device.type != "cuda")
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1556, in device
    return self._setup_devices
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/utils/generic.py", line 57, in __get__
    cached = self.fget(obj)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1541, in _setup_devices
    return self._setup_devices
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/utils/generic.py", line 57, in __get__
    return self._setup_devices
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/utils/generic.py", line 57, in __get__
    cached = self.fget(obj)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1541, in _setup_devices
    cached = self.fget(obj)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1541, in _setup_devices
    return self._setup_devices
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/utils/generic.py", line 57, in __get__
    cached = self.fget(obj)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/transformers/training_args.py", line 1541, in _setup_devices
    torch.distributed.init_process_group(backend="nccl", timeout=self.ddp_timeout_delta)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
    torch.distributed.init_process_group(backend="nccl", timeout=self.ddp_timeout_delta)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
    torch.distributed.init_process_group(backend="nccl", timeout=self.ddp_timeout_delta)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
    default_pg = _new_process_group_helper(
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 886, in _new_process_group_helper
    torch.distributed.init_process_group(backend="nccl", timeout=self.ddp_timeout_delta)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
    default_pg = _new_process_group_helper(
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 886, in _new_process_group_helper
    default_pg = _new_process_group_helper(
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 886, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
    default_pg = _new_process_group_helper(
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 886, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 16486) of binary: /Applications/Xcode.app/Contents/Developer/usr/bin/python3
Traceback (most recent call last):
  File "/Users/paulo/Library/Python/3.9/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/Users/paulo/Library/Python/3.9/lib/python/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-15_16:28:33
  host      : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 16487)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-03-15_16:28:33
  host      : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 16488)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-03-15_16:28:33
  host      : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 16489)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-15_16:28:33
  host      : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 16486)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

rtaori commented 1 year ago

It sounds like you don't have a version of pytorch that supports nccl. What hardware are you running on? Here's the relevant documentation with more info: https://pytorch.org/docs/stable/distributed.html. You may have to switch to a different communication backend.

jaideep11061982 commented 1 year ago

@paulocoutinhox what is the loss you use here. secondly, is 40gb single gpu sufficient to train with some >1 bs ?

tatsu-lab / stanford_alpaca

How to train with the Bible content? #36