Closed eshnil2000 closed 6 months ago
I wanted to update: I am able to manually setup NVMe SSD as a swap device and then force deepspeed to swap to NVMe SSD by limiting the CPU connected RAM available, but I still cant get the optimized, out of the box NVMe offload to function.
Can you post the rest of the ds_config? Based on the stack trace you gave the error is coming from accelerate and involves the batch size setting. Is that the only error?
The above is the entire ds_config file, there is nothing else. This is the only error. Batch size is set in the main python script:
accelerator = Accelerator()
model_name_or_path = "facebook/bart-large"
dataset_name = "twitter_complaints"
peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
text_column = "Tweet text"
label_column = "text_label"
lr = 3e-3
num_epochs = 2
batch_size = 8
seed = 42
max_length = 64
do_test = False
set_seed(seed)
dataset = load_dataset("ought/raft", dataset_name)
classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
dataset = dataset.map(
lambda x: {"text_label": [classes[label] for label in x["Label"]]},
batched=True,
num_proc=1,
)
I tried changing batch size to 1 but same error.
Since the error is coming from accelerate and the is no part of the DeepSpeed code in the stack trace, I believe it is a parsing error in their code. After review their code I suggest 1) You switch to the latest version. 2) Set train_micro_batch_size_per_gpu
in your ds_config to ensure it is being parsed correctly.
Describe the bug I'm able to run peft pre-trained models from: https://github.com/huggingface/peft I can successfully offload to CPU memory using deepspeed. The model I used is "facebook/bart-large"
but when I try to run the exact same model to offload to NVMe, I get this error: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
My accelerate config file "ds_zero3_nvme.yaml"
My deepspeed config file "deepspeed_config.json":
I built deepspeed from source with option DS_BUILD_AIO=1:
git clone https://github.com/huggingface/peft && cd peft accelerate launch --config_file ds_zero3_nvme.yaml examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py
Error:
Traceback (most recent call last): File "/home/ubuntu/peft/examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py", line 367, in
main()
File "/home/ubuntu/peft/examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py", line 231, in main
model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_scheduler = accelerator.prepare(
File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1220, in prepare
result = self._prepare_deepspeed(args)
File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1467, in _prepare_deepspeed
"train_batch_size": batch_size_per_device
TypeError: unsupported operand type(s) for : 'NoneType' and 'int'
Expected behavior A clear and concise description of what you expected to happen.
ds_report output
System info (please complete the following information):
Launcher context launched using accelerate