Support open_clip with NPU backend #813

MengqingCao opened 5 months ago

MengqingCao commented 5 months ago

openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backends:

And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well.

eval on npu run with:

python3 -m training.main \
    --model ViT-L-14 \
    --pretrained "./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin" \
    --seed 0 \
    --imagenet-val './data/ImageNet-1000/val'

The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

The evaluation results of ViT-L-14 on npu:

detailed training logs:

2024-02-05,08:00:10 | INFO | Running with a single process. Device npu:0.
2024-02-05,08:00:10 | INFO | Loaded ViT-L-14 model config.
2024-02-05,08:00:17 | INFO | Loading pretrained ViT-L-14 weights (./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin).
2024-02-05,08:00:21 | INFO | Model:
2024-02-05,08:00:21 | INFO | CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
    (patch_dropout): Identity()
    (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): ModuleList(
        (0-23): 24 x ResidualAttentionBlock(
          (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
          (ls_1): Identity()
          (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=1024, out_features=4096, bias=True)
            (gelu): GELU(approximate='none')
            (c_proj): Linear(in_features=4096, out_features=1024, bias=True)
          (ls_2): Identity()
    (ln_post): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  (transformer): Transformer(
    (resblocks): ModuleList(
      (0-11): 12 x ResidualAttentionBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
        (ls_1): Identity()
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        (ls_2): Identity()
  (token_embedding): Embedding(49408, 768)
  (ln_final): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
2024-02-05,08:00:21 | INFO | Params:
2024-02-05,08:00:21 | INFO |   accum_freq: 1
2024-02-05,08:00:21 | INFO |   aug_cfg: {}
2024-02-05,08:00:21 | INFO |   batch_size: 64
2024-02-05,08:00:21 | INFO |   beta1: 0.9
2024-02-05,08:00:21 | INFO |   beta2: 0.98
2024-02-05,08:00:21 | INFO |   checkpoint_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/checkpoints
2024-02-05,08:00:21 | INFO |   coca_caption_loss_weight: 2.0
2024-02-05,08:00:21 | INFO |   coca_contrastive_loss_weight: 1.0
2024-02-05,08:00:21 | INFO |   copy_codebase: False
2024-02-05,08:00:21 | INFO |   csv_caption_key: title
2024-02-05,08:00:21 | INFO |   csv_img_key: filepath
2024-02-05,08:00:21 | INFO |   csv_separator:   
2024-02-05,08:00:21 | INFO |   dataset_resampled: False
2024-02-05,08:00:21 | INFO |   dataset_type: auto
2024-02-05,08:00:21 | INFO |   ddp_static_graph: False
2024-02-05,08:00:21 | INFO |   debug: False
2024-02-05,08:00:21 | INFO |   delete_previous_checkpoint: False
2024-02-05,08:00:21 | INFO |   device: npu:0
2024-02-05,08:00:21 | INFO |   dist_backend: nccl
2024-02-05,08:00:21 | INFO |   dist_url: env://
2024-02-05,08:00:21 | INFO |   distill: False
2024-02-05,08:00:21 | INFO |   distill_model: None
2024-02-05,08:00:21 | INFO |   distill_pretrained: None
2024-02-05,08:00:21 | INFO |   distributed: False
2024-02-05,08:00:21 | INFO |   epochs: 32
2024-02-05,08:00:21 | INFO |   epochs_cooldown: None
2024-02-05,08:00:21 | INFO |   eps: 1e-06
2024-02-05,08:00:21 | INFO |   force_custom_text: False
2024-02-05,08:00:21 | INFO |   force_image_size: None
2024-02-05,08:00:21 | INFO |   force_patch_dropout: None
2024-02-05,08:00:21 | INFO |   force_quick_gelu: False
2024-02-05,08:00:21 | INFO |   gather_with_grad: False
2024-02-05,08:00:21 | INFO |   grad_checkpointing: False
2024-02-05,08:00:21 | INFO |   grad_clip_norm: None
2024-02-05,08:00:21 | INFO |   horovod: False
2024-02-05,08:00:21 | INFO |   image_interpolation: None
2024-02-05,08:00:21 | INFO |   image_mean: None
2024-02-05,08:00:21 | INFO |   image_resize_mode: None
2024-02-05,08:00:21 | INFO |   image_std: None
2024-02-05,08:00:21 | INFO |   imagenet_v2: None
2024-02-05,08:00:21 | INFO |   imagenet_val: ./data/ImageNet-1000/val
2024-02-05,08:00:21 | INFO |   local_loss: False
2024-02-05,08:00:21 | INFO |   local_rank: 0
2024-02-05,08:00:21 | INFO |   lock_image: False
2024-02-05,08:00:21 | INFO |   lock_image_freeze_bn_stats: False
2024-02-05,08:00:21 | INFO |   lock_image_unlocked_groups: 0
2024-02-05,08:00:21 | INFO |   lock_text: False
2024-02-05,08:00:21 | INFO |   lock_text_freeze_layer_norm: False
2024-02-05,08:00:21 | INFO |   lock_text_unlocked_layers: 0
2024-02-05,08:00:21 | INFO |   log_every_n_steps: 100
2024-02-05,08:00:21 | INFO |   log_level: 20
2024-02-05,08:00:21 | INFO |   log_local: False
2024-02-05,08:00:21 | INFO |   log_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/out.log
2024-02-05,08:00:21 | INFO |   logs: ./logs/
2024-02-05,08:00:21 | INFO |   lr: 0.0005
2024-02-05,08:00:21 | INFO |   lr_cooldown_end: 0.0
2024-02-05,08:00:21 | INFO |   lr_cooldown_power: 1.0
2024-02-05,08:00:21 | INFO |   lr_scheduler: cosine
2024-02-05,08:00:21 | INFO |   model: ViT-L-14
2024-02-05,08:00:21 | INFO |   name: 2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp
2024-02-05,08:00:21 | INFO |   no_set_device_rank: False
2024-02-05,08:00:21 | INFO |   precision: amp
2024-02-05,08:00:21 | INFO |   pretrained: ./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin
2024-02-05,08:00:21 | INFO |   pretrained_image: False
2024-02-05,08:00:21 | INFO |   rank: 0
2024-02-05,08:00:21 | INFO |   remote_sync: None
2024-02-05,08:00:21 | INFO |   remote_sync_frequency: 300
2024-02-05,08:00:21 | INFO |   remote_sync_protocol: s3
2024-02-05,08:00:21 | INFO |   report_to: 
2024-02-05,08:00:21 | INFO |   resume: None
2024-02-05,08:00:21 | INFO |   save_frequency: 1
2024-02-05,08:00:21 | INFO |   save_most_recent: False
2024-02-05,08:00:21 | INFO |   seed: 0
2024-02-05,08:00:21 | INFO |   siglip: False
2024-02-05,08:00:21 | INFO |   skip_scheduler: False
2024-02-05,08:00:21 | INFO |   tensorboard: False
2024-02-05,08:00:21 | INFO |   tensorboard_path: 
2024-02-05,08:00:21 | INFO |   torchcompile: False
2024-02-05,08:00:21 | INFO |   torchscript: False
2024-02-05,08:00:21 | INFO |   trace: False
2024-02-05,08:00:21 | INFO |   train_data: None
2024-02-05,08:00:21 | INFO |   train_data_upsampling_factors: None
2024-02-05,08:00:21 | INFO |   train_num_samples: None
2024-02-05,08:00:21 | INFO |   use_bn_sync: False
2024-02-05,08:00:21 | INFO |   use_bnb_linear: None
2024-02-05,08:00:21 | INFO |   val_data: None
2024-02-05,08:00:21 | INFO |   val_frequency: 1
2024-02-05,08:00:21 | INFO |   val_num_samples: None
2024-02-05,08:00:21 | INFO |   wandb: False
2024-02-05,08:00:21 | INFO |   wandb_notes: 
2024-02-05,08:00:21 | INFO |   wandb_project_name: open-clip
2024-02-05,08:00:21 | INFO |   warmup: 10000
2024-02-05,08:00:21 | INFO |   wd: 0.2
2024-02-05,08:00:21 | INFO |   workers: 4
2024-02-05,08:00:21 | INFO |   world_size: 1
2024-02-05,08:00:21 | INFO |   zeroshot_frequency: 2
2024-02-05,08:00:21 | INFO | Starting zero-shot imagenet.
2024-02-05,08:00:21 | INFO | Building zero-shot classifier
2024-02-05,08:01:13 | INFO | Using classifier
2024-02-05,08:02:09 | INFO | Finished zero-shot imagenet.
2024-02-05,08:02:09 | INFO | Eval Epoch: 0 imagenet-zeroshot-val-top1: 0.7889   imagenet-zeroshot-val-top5: 0.9546
rom1504 commented 5 months ago

Cool! How is the inference and training speed?

MengqingCao commented 5 months ago

Cool! How is the inference and training speed?

Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image

So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required?

rom1504 commented 5 months ago

A metric we usually look at is the sample/s per accelerator.

Some baselines: on one 3080 GPUs

Usually increasing the batch size to values like 256 help.

For training on one A100 it looks like

Usually with batch sizes around 128 per GPU.

I think it would be very interesting to have similar numbers on NPU

Cool! How is the inference and training speed?

Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image.png (view on web)

So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required?

MengqingCao commented 5 months ago

sorry for the late reply and thanks for your explanation.

I've noticed that code implementations of this metric exist in the training pipeline, and it is named samples_per_second_per_gpu in src/training/ I have tested the sample/s metric of npu on the training pipeline with the following results:

I'm a bit confused whether the inference speed you mentioned is the process of evaluating the CLIP model, or the inference process of using the CLIP model for zero-shot image classification?


B/32: image

L/14: image

MengqingCao commented 4 months ago

@rom1504 Hi, weeks went, if there is any suggestions or concerns, plz let me know and I'll address them as soon.

MengqingCao commented 3 months ago

Could anyone help for reviewing? Thx 👍 @rom1504 @rwightman @gabrielilharco @bryant1410 @mitchellnw

MengqingCao commented 1 month ago

Sorry for bothering you. Could you help for reviewing this PR? @rwightman @gabrielilharco