google-research / pix2seq

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Apache License 2.0
857 stars 71 forks source link

Cannot reproduce BLEU-4 score of 34.3 in Table 1 for image captioning task #31

Closed tj-zhu closed 1 year ago

tj-zhu commented 1 year ago

Hi there, first of all, thank you very much for sharing the code!

I tried the following command in README.md to evaluate model performance on image captioning task.

config=configs/config_multi_task.py:captioning@coco/2017_captioning,vit-b
model_dir=/tmp/pix2seq_eval_cap
python3 run.py --config=$config --model_dir=$model_dir --mode=eval

The checkpoint used is vit_b_640x640-ckpt-93324, and I used the pycocoevalcap to evaluate the results with coco captions 2017 validation dataset.

The metric I got for BLEU-4 score is only 14.1. In Table 1 of the paper, the score for Pix2Seq v2 multi-tasks (640*640) for Captioning task is 34.3. I am wondering if I did anything wrong. Could you please let me know how to resolve this?

Thank you in advance!

Screen Shot 2023-01-14 at 14 31 10 Screen Shot 2023-01-14 at 14 42 39

Here are the task configs:

{
  "dataset": {
    "batch_duplicates": 1,
    "cache_dataset": true,
    "coco_annotations_dir_for_metrics": "tmp/coco_annotations",
    "eval_num_examples": 5000,
    "eval_split": "validation",
    "label_shift": 0,
    "name": "coco/2017_captioning",
    "train_file_pattern": "gs://pix2seq/multi_task/data/coco/tfrecord/train*",
    "train_filename_for_metrics": "captions_train2017_eval_compatible.json",
    "train_num_examples": 118287,
    "train_split": "train",
    "val_file_pattern": "gs://pix2seq/multi_task/data/coco/tfrecord/val*",
    "val_filename_for_metrics": "captions_val2017_eval_compatible.json"
  },
  "datasets": [
    {
      "batch_duplicates": 1,
      "cache_dataset": true,
      "coco_annotations_dir_for_metrics": "tmp/coco_annotations",
      "eval_num_examples": 5000,
      "eval_split": "validation",
      "label_shift": 0,
      "name": "coco/2017_captioning",
      "train_file_pattern": "gs://pix2seq/multi_task/data/coco/tfrecord/train*",
      "train_filename_for_metrics": "captions_train2017_eval_compatible.json",
      "train_num_examples": 118287,
      "train_split": "train",
      "val_file_pattern": "gs://pix2seq/multi_task/data/coco/tfrecord/val*",
      "val_filename_for_metrics": "captions_val2017_eval_compatible.json"
    }
  ],
  "eval": {
    "batch_size": 8,
    "checkpoint_dir": "gs://pix2seq/multi_task/ckpt/vit_b_640x640",
    "steps": 0,
    "tag": "eval"
  },
  "model": {
    "coord_vocab_shift": 1000,
    "dec_proj_mode": "mlp",
    "decoder_output_bias": true,
    "dim_att": 768,
    "dim_att_dec": 512,
    "dim_mlp": 3072,
    "dim_mlp_dec": 2048,
    "drop_att": 0.0,
    "drop_path": 0.1,
    "drop_units": 0.1,
    "image_size": [
      640,
      640
    ],
    "max_seq_len": 512,
    "name": "encoder_ar_decoder",
    "num_decoder_layers": 6,
    "num_encoder_layers": 12,
    "num_heads": 12,
    "num_heads_dec": 16,
    "patch_size": 16,
    "pos_encoding": "sin_cos",
    "pos_encoding_dec": "learned",
    "pretrained_ckpt": "gs://pix2seq/obj365_pretrain/vit_b_640x640_b256_s400k",
    "resnet_variant": "c1",
    "shared_decoder_embedding": true,
    "text_vocab_shift": 3000,
    "use_cls_token": false,
    "vocab_size": 35000
  },
  "model_dir": "tmp/pix2seq_eval_cap",
  "optimization": {
    "beta1": 0.9,
    "beta2": 0.95,
    "end_lr_factor": 0.01,
    "eps": 1e-08,
    "global_clipnorm": -1,
    "learning_rate": 0.0001,
    "learning_rate_scaling": "none",
    "learning_rate_schedule": "linear",
    "optimizer": "adamw",
    "warmup_epochs": 10,
    "warmup_steps": 0,
    "weight_decay": 0.05
  },
  "task": {
    "captions_per_image": 5,
    "color_jitter_strength": 0.5,
    "eos_token_weight": 0.1,
    "image_size": [
      640,
      640
    ],
    "input_seq_drop_rate": 0.5,
    "jitter_scale_max": 1.0,
    "jitter_scale_min": 1.0,
    "max_instances_per_image": 5,
    "max_seq_len": 128,
    "metric": {
      "name": "coco_captioning"
    },
    "name": "captioning",
    "temperature": 1.0,
    "top_k": 0,
    "top_p": 1.0,
    "vocab_id": 13,
    "weight": 1.0
  },
  "tasks": [
    {
      "captions_per_image": 5,
      "color_jitter_strength": 0.5,
      "eos_token_weight": 0.1,
      "image_size": [
        640,
        640
      ],
      "input_seq_drop_rate": 0.5,
      "jitter_scale_max": 1.0,
      "jitter_scale_min": 1.0,
      "max_instances_per_image": 5,
      "max_seq_len": 128,
      "metric": {
        "name": "coco_captioning"
      },
      "name": "captioning",
      "temperature": 1.0,
      "top_k": 0,
      "top_p": 1.0,
      "vocab_id": 13,
      "weight": 1.0
    }
  ],
  "tokenizer": {
    "add_bos": false,
    "add_eos": true,
    "sentencepiece_model": "gs://pix2seq/multi_task/data/c4_en_32k_spm.model"
  },
  "train": {
    "batch_size": 128,
    "checkpoint_epochs": 1,
    "checkpoint_steps": 0,
    "epochs": 100,
    "keep_checkpoint_max": 10,
    "loss_type": "xent",
    "steps": 0
  }
}
chentingpc commented 1 year ago

@saxenasaurabh may have more context on this.

tj-zhu commented 1 year ago

@chentingpc Thank you! @saxenasaurabh Could you please help on this? Thank you very much!

chentingpc commented 1 year ago

Here is an example caption output with 32.8 BLEU-4: https://storage.googleapis.com/pix2seq/others/coco_caption_result_example.json. Could you try to validate if you can get similar metric? If you do get similar result, then it may be related to the model inference, otherwise the evaluation process may be the cause.

tj-zhu commented 1 year ago

@chentingpc Thank you for providing the file! I got BLEU-4 score of 33.6 for the file you just provided. So I think the evaluation process is fine. It might be the model inference process causing the issue.

Could it be the configs (listed in the first comment of this issue) are off, or the model checkpoint (vit_b_640x640-ckpt-93324) is not right?

chentingpc commented 1 year ago

Can you try to run the inference with top_k=1?

also, to double check if the checkpoint is the culprit you may use it for running another task and see if the results check out.

tj-zhu commented 1 year ago

@chentingpc Thank you! I got BLEU-4 score of 35.3 using top_k=1. I'll close the issue. Again, thank you very much for your help!