Open i-gao opened 1 year ago
Some other todos I want to add to this:
I have a keen interest in exploring the latest features. To that end, I've integrated the deepspeed-related code into the current main branch of Openflamingo, which includes functions like get_deepspeed_config(). During my testing, I observed that the code runs smoothly with the setting deepspeed_stage = 2 and exhibits significantly efficiency improvement compared to fsdp. However, when I attempted to configure it with deepspeed_stage = 3, an error was encountered during the execution of the loss backward propagation process:
model.backward(divided_loss_laion)
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)ret_val = func(*args, **kwargs)
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1923, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: The size of tensor a (0) must match the size of tensor b (8192) at non-singleton dimension 1
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/yongqi/miniconda3/envs/openflamingo/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: The size of tensor a (0) must match the size of tensor b (8192) at non-singleton dimension 1
Do you have any idea about this? Or have you encountered this problem when developing the new version.
You said you integrated “deepspeed-related code into the current main branch of Openflamingo”. Have you tried using this branch as is? The integration is basically complete but we are doing more testing to be certain. I will also test out stage 3 again to make sure we haven’t missed anything.
You said you integrated “deepspeed-related code into the current main branch of Openflamingo”. Have you tried using this branch as is? The integration is basically complete but we are doing more testing to be certain. I will also test out stage 3 again to make sure we haven’t missed anything.
I did not directly run this branch, as I have developed my project based on the main branch. Therefore, I just copy the deepspeed-related code in this branch to my code. The error is very strange: 1) Stage 2 works, but stage 3 reports the error; 2) The error occurred while executing loss backward, but the backward process rarely reports errors; 3) Which tensor has a size 0 as reported. If you have no idea about this, I have to run my code with deepspeed stage 2. Thanks!
You said you integrated “deepspeed-related code into the current main branch of Openflamingo”. Have you tried using this branch as is? The integration is basically complete but we are doing more testing to be certain. I will also test out stage 3 again to make sure we haven’t missed anything.
I tried this branch, and it works well on the training part. I also tested the evaluation part of the branch "Merge wilds mllm". Unfortunately, there are some bugs. I reported two bugs in "https://github.com/mlfoundations/open_flamingo/pull/266".
New models
VLM
class. See documentation insrc/vlm.py
VLMWithCrossAttention
(dense xattn to fuse vision + language, Flamingo-style) vs.VLMWithLanguageStream
(insert vision tokens into the language stream, Kosmos-style)FSDP Updates
Training code refactor
train_one_epoch
now accepts a list of datasets and executes the same loss function on all of them. This permits us to decide the datasets to train on at runtime (e.g. just LAION) and makes adding in datasets more flexible. To train on a dataset, set the--{dataset_name}_shards
arg (e.g.--laion_shards
). If this is None, then we will not train on that dataset (i.e., skip LAION)train_one_epoch
also now accepts a loss function decided at runtime. Losses are found intrain/losses.py
. Currently, only next token prediction is implemented, but this allows us to work on adding contrastive-generative losses.train/distributed.py
in an attempt to streamlinetrain/train.py
Steps before merging
lang_model
instead oflang_encoder
; this will not play well with the released weights; we need to decide what to do about the embeddings).Steps after merging