Open YooSungHyun opened 9 months ago
maybe, that option is working with deepspeed.initialize(training_data=...)
only...??
i am not initialize with deepspeed... i'm using torch.utils.data.Dataset
and torch's dataloader, not deepspeed wrapper
i given argument to model like model(**batch)
, but, deepspeed auto_cast is only working *args
.
Replacing it with model(batch["inputs"])
worked for me, but I got an error in backward(). I'm also using torch optimizer for the optimizer.
Found dtype Float but expected Half
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1958, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/data/bart/temp_workspace/pytorch-trainer/.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/data/bart/temp_workspace/pytorch-trainer/ds_train.py", line 115, in training_step
model.backward(loss)
File "/data/bart/temp_workspace/pytorch-trainer/trainer/deepspeed.py", line 233, in train_loop
loss = self.training_step(model=model, batch=batch, batch_idx=batch_idx)
File "/data/bart/temp_workspace/pytorch-trainer/trainer/deepspeed.py", line 155, in fit
self.train_loop(
File "/data/bart/temp_workspace/pytorch-trainer/ds_train.py", line 583, in main
trainer.fit(
File "/data/bart/temp_workspace/pytorch-trainer/ds_train.py", line 606, in <module>
main(args)
RuntimeError: Found dtype Float but expected Half
For auto_cast, I'm using torch.cuda.amp
, which I'm sure will work, but will that cause any problems when utilizing offload etc?
with autocast(enabled=True, dtype=torch.float16):
labels = batch.pop("labels")
output = model(batch["inputs"])
loss = self.criterion(output, labels)
same issue here. Don't know if torch.autocast can be used together with deepspeed fp16
Describe the bug A clear and concise description of what the bug is. i'm working on https://github.com/YooSungHyun/pytorch-trainer
ds_train.py
when i forward deepspeed config
fp16
, model weight isfp16
but input data isfp32
i know that autocast is make this well, but raised on error like thisWhat did I do wrong?
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen. forward well
ds_report output Please run
ds_report
to give us details about your setup.Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?Docker context Are you using a specific docker image that you can share?
Additional context my zero1 config like this...