Instruction about training Open-Flamingo from scratch

HenryHZY commented 1 year ago

Hi @anas-awadalla

As described in #124, "Our training took place on 32 80GB A100s. We trained on 5M samples from MMC4 and 10M from LAION 2B."

I am interested in the details of loss during training. And if possible, I would like to extend it to other research fields. Could you please provide an instruction about training Open-Flamingo from scratch? It would be of great help to my research.

Thank you very much for your great project!

HenryHZY commented 1 year ago

By the way, I noticed that the README provides a Training part. However, it lacks a lot of details:)

anas-awadalla commented 1 year ago

Thank you for your interest @HenryHZY. Can you please let me know what details (in addition to loss) would be most helpful and I will be sure to add them. Currently training from scratch is not possible as MMC4 is not public yet (but will be very soon).

HenryHZY commented 1 year ago

@anas-awadalla Thanks for your quick reply.

Take your running command as an example, how can I change the following command to only train on LAION-2B based on a pre-trained OPT-1.3B?

torchrun --nnodes=1 --nproc_per_node=4 train.py \
--run_name flamingo3B \
--lm_path facebook/opt-1.3b \
--tokenizer_path facebook/opt-1.3b \
--dataset_resampled \
--laion_shards "/path/to/shards/shard-{0000..0999}.tar" \
--mmc4_shards "/path/to/shards/shard-{0000..0999}.tar" \
--batch_size_mmc4 4 \
--batch_size_laion 8 \
--train_num_samples_mmc4 125000 \
--train_num_samples_laion 250000 \
--loss_multiplier_laion 0.2 \
--workers=6 \
--num_epochs 250 \
--lr_scheduler constant \
--warmup_steps 5000 \
--use_media_placement_augmentation \
--mmc4_textsim_threshold 30

By the way, I would like to ask about the contribution of MMC4 for training. Have you conducted an ablation study on MMC4 + LAION-2B and LAION-2B only? Thank you very much for your time and consideration!

anas-awadalla commented 1 year ago

Got it. This is currently not an option but definitely should be! I will open an issue (feel free to contribute or if not I can do this next week).

As for your second point we have not done these experiments but I agree that they would be very useful datapoints.

Soonhwan-Kwon commented 1 year ago

Thank you for wondeful code release, and I have question for training flamingo9B, and only with laion2B like below. I stuck with GPU Out of Memory error with batch size of 1 with A100 80GB * 8gpus. Any specific option or traning method did you use for flamingo9B, different from flamingo3B ? Thank you in advance.

torchrun --nnodes=1 --nproc_per_node=4 train.py \
        --run_name flamingo9B \
        --lm_path {llama7B_path} \
        --tokenizer_path {llama7B_path} \
        --dataset_resampled \
        --laion_shards {laion2b path} \
        --batch_size_laion 1 \
        --train_num_samples_laion 25000 \
        --loss_multiplier_laion 1.0 \
        --workers=6 \
        --num_epochs 250 \
        --lr_scheduler constant \
        --warmup_steps 5000 \
        --use_media_placement_augmentation

Soonhwan-Kwon commented 1 year ago

Yes it is using #137, and I successfully trained flamingo3B(not 9B) with this code.

anas-awadalla commented 1 year ago

@Soonhwan-Kwon The issue here is that you are adding a cross attention layer after every layer in llama 7B. I am not sure what the total number of parameters is using this setup but it is way larger than 9B. You should set --cross_attn_every_n_layers to 4 in the training args to get the right number of parameters. In this setup I am able to fit batch size of 8 per gpu.

Soonhwan-Kwon commented 1 year ago

@Soonhwan-Kwon The issue here is that you are adding a cross attention layer after every layer in llama 7B. I am not sure what the total number of parameters is using this setup but it is way larger than 9B. You should set --cross_attn_every_n_layers to 4 in the training args to get the right number of parameters. In this setup I am able to fit batch size of 8 per gpu.

Thank you for the quick reply! you saved my day. Thank you!

HenryHZY commented 1 year ago

@Soonhwan-Kwon @anas-awadalla Thanks for your great reply!! I will try it later:)

itzsid commented 1 year ago

Hi @anas-awadalla, Thanks for the great repo. I'm trying to reproduce OpenFlamingo results using mpt-1b-redpajama-200b with a single 40GB A100 node. Even though the results on VQA tasks are similar to what is reported, COCO CIDEr numbers are much worse. In the recently released paper, it was mentioned that 8 A100 nodes were used for training. So, I'm wondering have you done any experiments to check how long do I have to train to get the same performance as 8 A100 nodes? Do I have to train enough to see 5M MMC4 and 10M LAION samples? Have you seen any influence of the effective batch size on the final metrics when using multinode vs maybe a single GPU?

anas-awadalla commented 1 year ago

Hello @itzsid! For all the models we released, we trained on 120M samples from LAION and 60M from mmc4. How many samples have you trained your version on? What is the performance on COCO for you? For our version of OpenFlamingo3B we used effective batch sizes 1152 and 2304 for mmc4 and LAION respectively and 1875 warmup steps. However you can use much lower batch sizes and still get similar performance but you should scale the warmup steps accordingly.

itzsid commented 1 year ago

@anas-awadalla I trained for approximately 10M samples. Zero-shot COCO CIDEr is 36.55 for me vs 75.9 using the released model. I think one of the issue is that the loss curve I get for LAION does not exactly match the Figure 5 results in the paper. My LAION loss curve look like this:

MMC4 loss is in the similar range as shown in the paper:

anas-awadalla commented 1 year ago

We apply smoothing to the loss curve in the paper so these loss plots look fine to me! Is that 10M samples of LAION and 5M samples of MMC4 then? If so then seems like your training run is on track.

Here is how 0-shot COCO improves during the training of our mpt-1b-redpajama-200b-dolly model:

*Note that these are validation scores so the numbers will look a little different than what we report in the paper.

itzsid commented 1 year ago

Thanks @anas-awadalla. This is super helpful. I'll train the models longer and check the performance after 10M mmc4 + 20M laion.

itzsid commented 1 year ago

@anas-awadalla I get similar values as above after going through 150M samples. Thanks for the help! Next I'm trying to train a larger model with MPT-7B (anas-awadalla/mpt-7b). Wondering how much did you reduce the batch size to fit in the memory? I'm using 40GB A100. Also, I use amp_bf16 as suggested in the paper. These are the current args for 7b model:

open_flamingo.train.train \
    --lm_path anas-awadalla/mpt-7b \
    --tokenizer_path anas-awadalla/mpt-7b \
    --cross_attn_every_n_layers 4 \
    --dataset_resampled \
    --batch_size_mmc4 2 \
    --batch_size_laion 4 \
    --train_num_samples_mmc4 125000\
    --train_num_samples_laion 250000 \
    --loss_multiplier_laion 0.2 \
    --workers=4 \
    --num_epochs 480 \
    --warmup_steps  1875 \
    --mmc4_textsim_threshold 0.24 \
    --gradient_checkpointing \
    --gradient_accumulation_steps 2 \
    --precision amp_bf16

anas-awadalla commented 1 year ago

Great! We used ddp with 80GB A100s for the 9B model. You should be able to train with higher batch sizes on the 40GB ones using our fsdp implementation. You can add flags "--fsdp", "--fsdp_use_orig_params", "--fsdp_sharding_strategy = "hybrid"" to the train script to do so.

itzsid commented 1 year ago

@anas-awadalla Using FSDP args mentioned above with MPT-7B, I get this error:

File "/root/.cache/huggingface/modules/transformers_modules/anas-awadalla/mpt-7b/b772e556c8e8a17d087db6935e7cd019e5eefb0f/modeling_mpt.py", line 184, in forward
    (attn_bias, attention_mask) = self._attn_bias(device=x.device, dtype=x.dtype, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/anas-awadalla/mpt-7b/b772e556c8e8a17d087db6935e7cd019e5eefb0f/modeling_mpt.py", line 109, in _attn_bias
    attn_bias = attn_bias.masked_fill(~attention_mask.view(-1, 1, 1, s_k), min_val)
RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:6

Any ideas?

anas-awadalla commented 1 year ago

So sorry for the late reply @itzsid! I noticed that there was a typo in the mmc4 forward pass. I fixed it at #250 and I anticipate this is what was leading to this error. Lmk how it goes.

itzsid commented 1 year ago

Thanks @anas-awadalla. Similar to LAION forward pass, I added these lines which made it work:

                input_ids = input_ids.to(device_id, dtype=cast_dtype, non_blocking=True)
                attention_mask = attention_mask.to(device_id, dtype=cast_dtype, non_blocking=True)

However, this issue only shows up when fsdp is enabled.

Additionally, I had to comment out (https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/flamingo.py#L271-L276):

            self.lang_encoder.set_input_embeddings(
                wrap(wrap(self.lang_encoder.get_input_embeddings()))
            )
            self.lang_encoder.set_output_embeddings(
                wrap(wrap(self.lang_encoder.get_output_embeddings()))
            )

otherwise I get the error:

  File "/root/.cache/huggingface/modules/transformers_modules/anas-awadalla/mpt-7b/b772e556c8e8a17d087db6935e7cd019e5eefb0f/modeling_mpt.py", line 255, in forward
    logits = F.linear(outputs.last_hidden_state.to(self.transformer.wte.weight.device), self.transformer.wte.weight)
RuntimeError: size mismatch, got 1024, 1024x4096,25743360

I also had to comment out (https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/flamingo.py#L299):

            self.lang_encoder.get_input_embeddings().clip_grad_norm_(max_norm)

Otherwise, it threw an error:

  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Embedding' object has no attribute 'clip_grad_norm_'

Did you have these issues on your end too? I'm using anas-awadalla/mpt-7b as the LM.

anas-awadalla commented 1 year ago

Hmm no we don't run into these. Just to confirm you are using torch 2.0.1?

itzsid commented 1 year ago

Hmm no we don't run into these. Just to confirm you are using torch 2.0.1?

Yes, my torch version is 2.0.1+cu117. Do you have a docker container as well with all dependencies? I can try running it inside the container.

itzsid commented 1 year ago

@anas-awadalla I started a training with MPT-7B on 80GB nodes itself. However, I see vqa numbers going down with increasing number of samples seen. Did you see something similar?

Here is a plot for OK-VQA numbers after 100M LAION+MMC4 samples.

I used these args for training 9B model:

 --lm_path anas-awadalla/mpt-7b \
    --tokenizer_path anas-awadalla/mpt-7b \
    --cross_attn_every_n_layers 4 \
    --dataset_resampled \
    --batch_size_mmc4 16 \
    --batch_size_laion 32 \
    --train_num_samples_mmc4 125000\
    --train_num_samples_laion 250000 \
    --loss_multiplier_laion 0.2 \
    --workers=4 \
    --num_epochs 480 \
    --warmup_steps  1875 \

anas-awadalla commented 1 year ago

Screenshot 2023-09-08 at 3 04 35 PM

This is how downstream validation performance changes for COCO and VQAv2 for the 9B model. Our experience with VQA performance is that it stays relatively constant apart from an initial increase during training. We do see the behavior you are reporting if the mmc4 image-text similarity threshold is too high (we use 0.24). What value are you using for that? Also just checking that you are using the full mmc4 so data repetition is not an issue?

itzsid commented 1 year ago

Thanks @anas-awadalla for the table. This is quite helpful. In my case, I do use mmc4_textsim_threshold=0.24. However, I use mmc4-ff with 375M images. Do you use the full mmc4 dataset with 571M images for this training?

anas-awadalla commented 1 year ago

Ah ok that could be the reason because we do use the full set. Especially since you do hit ~37 and assuming this is zero-shot this would match what we got. How are coco and vqav2 scores? Are they also degrading?

itzsid commented 1 year ago

Thanks, that makes sense. COCO numbers are stable around 60. I didn't measure VQAv2 numbers but Vizwiz VQA, TextVQA and OKVQA are degrading.

mlfoundations / open_flamingo

Instruction about training Open-Flamingo from scratch #129