mlfoundations / open_flamingo

An open-source framework for training large multimodal models.
MIT License
3.74k stars 284 forks source link

Major refactor to support new architectures #261

Open i-gao opened 1 year ago

i-gao commented 1 year ago

New models

FSDP Updates

Training code refactor

Steps before merging

Steps after merging

anas-awadalla commented 1 year ago

Some other todos I want to add to this:

liyongqi67 commented 1 year ago

I have a keen interest in exploring the latest features. To that end, I've integrated the deepspeed-related code into the current main branch of Openflamingo, which includes functions like get_deepspeed_config(). During my testing, I observed that the code runs smoothly with the setting deepspeed_stage = 2 and exhibits significantly efficiency improvement compared to fsdp. However, when I attempted to configure it with deepspeed_stage = 3, an error was encountered during the execution of the loss backward propagation process:

    model.backward(divided_loss_laion)
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)ret_val = func(*args, **kwargs)

  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1923, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: The size of tensor a (0) must match the size of tensor b (8192) at non-singleton dimension 1
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: The size of tensor a (0) must match the size of tensor b (8192) at non-singleton dimension 1

Do you have any idea about this? Or have you encountered this problem when developing the new version.

anas-awadalla commented 1 year ago

You said you integrated “deepspeed-related code into the current main branch of Openflamingo”. Have you tried using this branch as is? The integration is basically complete but we are doing more testing to be certain. I will also test out stage 3 again to make sure we haven’t missed anything.

liyongqi67 commented 1 year ago

You said you integrated “deepspeed-related code into the current main branch of Openflamingo”. Have you tried using this branch as is? The integration is basically complete but we are doing more testing to be certain. I will also test out stage 3 again to make sure we haven’t missed anything.

I did not directly run this branch, as I have developed my project based on the main branch. Therefore, I just copy the deepspeed-related code in this branch to my code. The error is very strange: 1) Stage 2 works, but stage 3 reports the error; 2) The error occurred while executing loss backward, but the backward process rarely reports errors; 3) Which tensor has a size 0 as reported. If you have no idea about this, I have to run my code with deepspeed stage 2. Thanks!

liyongqi67 commented 1 year ago

You said you integrated “deepspeed-related code into the current main branch of Openflamingo”. Have you tried using this branch as is? The integration is basically complete but we are doing more testing to be certain. I will also test out stage 3 again to make sure we haven’t missed anything.

I tried this branch, and it works well on the training part. I also tested the evaluation part of the branch "Merge wilds mllm". Unfortunately, there are some bugs. I reported two bugs in "https://github.com/mlfoundations/open_flamingo/pull/266".