Open HenryHZY opened 1 year ago
By the way, I noticed that the README provides a Training part. However, it lacks a lot of details:)
Thank you for your interest @HenryHZY. Can you please let me know what details (in addition to loss) would be most helpful and I will be sure to add them. Currently training from scratch is not possible as MMC4 is not public yet (but will be very soon).
@anas-awadalla Thanks for your quick reply.
Take your running command as an example, how can I change the following command to only train on LAION-2B based on a pre-trained OPT-1.3B?
torchrun --nnodes=1 --nproc_per_node=4 train.py \
--run_name flamingo3B \
--lm_path facebook/opt-1.3b \
--tokenizer_path facebook/opt-1.3b \
--dataset_resampled \
--laion_shards "/path/to/shards/shard-{0000..0999}.tar" \
--mmc4_shards "/path/to/shards/shard-{0000..0999}.tar" \
--batch_size_mmc4 4 \
--batch_size_laion 8 \
--train_num_samples_mmc4 125000 \
--train_num_samples_laion 250000 \
--loss_multiplier_laion 0.2 \
--workers=6 \
--num_epochs 250 \
--lr_scheduler constant \
--warmup_steps 5000 \
--use_media_placement_augmentation \
--mmc4_textsim_threshold 30
By the way, I would like to ask about the contribution of MMC4 for training. Have you conducted an ablation study on MMC4 + LAION-2B and LAION-2B only? Thank you very much for your time and consideration!
Got it. This is currently not an option but definitely should be! I will open an issue (feel free to contribute or if not I can do this next week).
As for your second point we have not done these experiments but I agree that they would be very useful datapoints.
Thank you for wondeful code release, and I have question for training flamingo9B, and only with laion2B like below. I stuck with GPU Out of Memory error with batch size of 1 with A100 80GB * 8gpus. Any specific option or traning method did you use for flamingo9B, different from flamingo3B ? Thank you in advance.
torchrun --nnodes=1 --nproc_per_node=4 train.py \
--run_name flamingo9B \
--lm_path {llama7B_path} \
--tokenizer_path {llama7B_path} \
--dataset_resampled \
--laion_shards {laion2b path} \
--batch_size_laion 1 \
--train_num_samples_laion 25000 \
--loss_multiplier_laion 1.0 \
--workers=6 \
--num_epochs 250 \
--lr_scheduler constant \
--warmup_steps 5000 \
--use_media_placement_augmentation
Yes it is using #137, and I successfully trained flamingo3B(not 9B) with this code.
@Soonhwan-Kwon The issue here is that you are adding a cross attention layer after every layer in llama 7B. I am not sure what the total number of parameters is using this setup but it is way larger than 9B. You should set --cross_attn_every_n_layers
to 4 in the training args to get the right number of parameters. In this setup I am able to fit batch size of 8 per gpu.
@Soonhwan-Kwon The issue here is that you are adding a cross attention layer after every layer in llama 7B. I am not sure what the total number of parameters is using this setup but it is way larger than 9B. You should set
--cross_attn_every_n_layers
to 4 in the training args to get the right number of parameters. In this setup I am able to fit batch size of 8 per gpu.
Thank you for the quick reply! you saved my day. Thank you!
@Soonhwan-Kwon @anas-awadalla Thanks for your great reply!! I will try it later:)
Hi @anas-awadalla, Thanks for the great repo. I'm trying to reproduce OpenFlamingo results using mpt-1b-redpajama-200b with a single 40GB A100 node. Even though the results on VQA tasks are similar to what is reported, COCO CIDEr numbers are much worse. In the recently released paper, it was mentioned that 8 A100 nodes were used for training. So, I'm wondering have you done any experiments to check how long do I have to train to get the same performance as 8 A100 nodes? Do I have to train enough to see 5M MMC4 and 10M LAION samples? Have you seen any influence of the effective batch size on the final metrics when using multinode vs maybe a single GPU?
Hello @itzsid! For all the models we released, we trained on 120M samples from LAION and 60M from mmc4. How many samples have you trained your version on? What is the performance on COCO for you? For our version of OpenFlamingo3B we used effective batch sizes 1152 and 2304 for mmc4 and LAION respectively and 1875 warmup steps. However you can use much lower batch sizes and still get similar performance but you should scale the warmup steps accordingly.
@anas-awadalla I trained for approximately 10M samples. Zero-shot COCO CIDEr is 36.55 for me vs 75.9 using the released model. I think one of the issue is that the loss curve I get for LAION does not exactly match the Figure 5 results in the paper. My LAION loss curve look like this:
MMC4 loss is in the similar range as shown in the paper:
We apply smoothing to the loss curve in the paper so these loss plots look fine to me! Is that 10M samples of LAION and 5M samples of MMC4 then? If so then seems like your training run is on track.
Here is how 0-shot COCO improves during the training of our mpt-1b-redpajama-200b-dolly model:
Data scale | Cider score* 5M mmc4 + 10M laion | 36.29 10M mmc4 + 20M laion | 55.89 20M mmc4 + 40M laion | 66.04 30M mmc4 + 60M laion | 66.30 40M mmc4 + 80M laion | 72.16 50M mmc4 + 100M laion | 69.95 60M mmc4 + 120M laion | 72.34
*Note that these are validation scores so the numbers will look a little different than what we report in the paper.
Thanks @anas-awadalla. This is super helpful. I'll train the models longer and check the performance after 10M mmc4 + 20M laion.
@anas-awadalla I get similar values as above after going through 150M samples. Thanks for the help! Next I'm trying to train a larger model with MPT-7B (anas-awadalla/mpt-7b). Wondering how much did you reduce the batch size to fit in the memory? I'm using 40GB A100. Also, I use amp_bf16 as suggested in the paper. These are the current args for 7b model:
open_flamingo.train.train \
--lm_path anas-awadalla/mpt-7b \
--tokenizer_path anas-awadalla/mpt-7b \
--cross_attn_every_n_layers 4 \
--dataset_resampled \
--batch_size_mmc4 2 \
--batch_size_laion 4 \
--train_num_samples_mmc4 125000\
--train_num_samples_laion 250000 \
--loss_multiplier_laion 0.2 \
--workers=4 \
--num_epochs 480 \
--warmup_steps 1875 \
--mmc4_textsim_threshold 0.24 \
--gradient_checkpointing \
--gradient_accumulation_steps 2 \
--precision amp_bf16
Great! We used ddp with 80GB A100s for the 9B model. You should be able to train with higher batch sizes on the 40GB ones using our fsdp implementation. You can add flags "--fsdp", "--fsdp_use_orig_params", "--fsdp_sharding_strategy = "hybrid"" to the train script to do so.
@anas-awadalla Using FSDP args mentioned above with MPT-7B, I get this error:
File "/root/.cache/huggingface/modules/transformers_modules/anas-awadalla/mpt-7b/b772e556c8e8a17d087db6935e7cd019e5eefb0f/modeling_mpt.py", line 184, in forward
(attn_bias, attention_mask) = self._attn_bias(device=x.device, dtype=x.dtype, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/anas-awadalla/mpt-7b/b772e556c8e8a17d087db6935e7cd019e5eefb0f/modeling_mpt.py", line 109, in _attn_bias
attn_bias = attn_bias.masked_fill(~attention_mask.view(-1, 1, 1, s_k), min_val)
RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:6
Any ideas?
So sorry for the late reply @itzsid! I noticed that there was a typo in the mmc4 forward pass. I fixed it at #250 and I anticipate this is what was leading to this error. Lmk how it goes.
Thanks @anas-awadalla. Similar to LAION forward pass, I added these lines which made it work:
input_ids = input_ids.to(device_id, dtype=cast_dtype, non_blocking=True)
attention_mask = attention_mask.to(device_id, dtype=cast_dtype, non_blocking=True)
However, this issue only shows up when fsdp is enabled.
Additionally, I had to comment out (https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/flamingo.py#L271-L276):
self.lang_encoder.set_input_embeddings(
wrap(wrap(self.lang_encoder.get_input_embeddings()))
)
self.lang_encoder.set_output_embeddings(
wrap(wrap(self.lang_encoder.get_output_embeddings()))
)
otherwise I get the error:
File "/root/.cache/huggingface/modules/transformers_modules/anas-awadalla/mpt-7b/b772e556c8e8a17d087db6935e7cd019e5eefb0f/modeling_mpt.py", line 255, in forward
logits = F.linear(outputs.last_hidden_state.to(self.transformer.wte.weight.device), self.transformer.wte.weight)
RuntimeError: size mismatch, got 1024, 1024x4096,25743360
I also had to comment out (https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/flamingo.py#L299):
self.lang_encoder.get_input_embeddings().clip_grad_norm_(max_norm)
Otherwise, it threw an error:
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1614, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Embedding' object has no attribute 'clip_grad_norm_'
Did you have these issues on your end too? I'm using anas-awadalla/mpt-7b
as the LM.
Hmm no we don't run into these. Just to confirm you are using torch 2.0.1?
Hmm no we don't run into these. Just to confirm you are using torch 2.0.1?
Yes, my torch version is 2.0.1+cu117. Do you have a docker container as well with all dependencies? I can try running it inside the container.
@anas-awadalla I started a training with MPT-7B on 80GB nodes itself. However, I see vqa numbers going down with increasing number of samples seen. Did you see something similar?
Here is a plot for OK-VQA numbers after 100M LAION+MMC4 samples.
I used these args for training 9B model:
--lm_path anas-awadalla/mpt-7b \
--tokenizer_path anas-awadalla/mpt-7b \
--cross_attn_every_n_layers 4 \
--dataset_resampled \
--batch_size_mmc4 16 \
--batch_size_laion 32 \
--train_num_samples_mmc4 125000\
--train_num_samples_laion 250000 \
--loss_multiplier_laion 0.2 \
--workers=4 \
--num_epochs 480 \
--warmup_steps 1875 \
This is how downstream validation performance changes for COCO and VQAv2 for the 9B model. Our experience with VQA performance is that it stays relatively constant apart from an initial increase during training. We do see the behavior you are reporting if the mmc4 image-text similarity threshold is too high (we use 0.24). What value are you using for that? Also just checking that you are using the full mmc4 so data repetition is not an issue?
Thanks @anas-awadalla for the table. This is quite helpful. In my case, I do use mmc4_textsim_threshold=0.24
. However, I use mmc4-ff
with 375M images. Do you use the full mmc4
dataset with 571M images for this training?
Ah ok that could be the reason because we do use the full set. Especially since you do hit ~37 and assuming this is zero-shot this would match what we got. How are coco and vqav2 scores? Are they also degrading?
Thanks, that makes sense. COCO numbers are stable around 60. I didn't measure VQAv2 numbers but Vizwiz VQA, TextVQA and OKVQA are degrading.
Hi @anas-awadalla
As described in #124, "Our training took place on 32 80GB A100s. We trained on 5M samples from MMC4 and 10M from LAION 2B."
I am interested in the details of loss during training. And if possible, I would like to extend it to other research fields. Could you please provide an instruction about training Open-Flamingo from scratch? It would be of great help to my research.
Thank you very much for your great project!