mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.29k stars 923 forks source link

Support open_clip with NPU backend #813

Open MengqingCao opened 5 months ago

MengqingCao commented 5 months ago

openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backends:

And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well.

eval on npu run with:

python3 -m training.main \
    --model ViT-L-14 \
    --pretrained "./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin" \
    --seed 0 \
    --imagenet-val './data/ImageNet-1000/val'

The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

The evaluation results of ViT-L-14 on npu:

detailed training logs:

2024-02-05,08:00:10 | INFO | Running with a single process. Device npu:0.
2024-02-05,08:00:10 | INFO | Loaded ViT-L-14 model config.
2024-02-05,08:00:17 | INFO | Loading pretrained ViT-L-14 weights (./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin).
2024-02-05,08:00:21 | INFO | Model:
2024-02-05,08:00:21 | INFO | CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
    (patch_dropout): Identity()
    (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): ModuleList(
        (0-23): 24 x ResidualAttentionBlock(
          (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
          )
          (ls_1): Identity()
          (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=1024, out_features=4096, bias=True)
            (gelu): GELU(approximate='none')
            (c_proj): Linear(in_features=4096, out_features=1024, bias=True)
          )
          (ls_2): Identity()
        )
      )
    )
    (ln_post): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (transformer): Transformer(
    (resblocks): ModuleList(
      (0-11): 12 x ResidualAttentionBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
        )
        (ls_1): Identity()
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
        (ls_2): Identity()
      )
    )
  )
  (token_embedding): Embedding(49408, 768)
  (ln_final): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2024-02-05,08:00:21 | INFO | Params:
2024-02-05,08:00:21 | INFO |   accum_freq: 1
2024-02-05,08:00:21 | INFO |   aug_cfg: {}
2024-02-05,08:00:21 | INFO |   batch_size: 64
2024-02-05,08:00:21 | INFO |   beta1: 0.9
2024-02-05,08:00:21 | INFO |   beta2: 0.98
2024-02-05,08:00:21 | INFO |   checkpoint_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/checkpoints
2024-02-05,08:00:21 | INFO |   coca_caption_loss_weight: 2.0
2024-02-05,08:00:21 | INFO |   coca_contrastive_loss_weight: 1.0
2024-02-05,08:00:21 | INFO |   copy_codebase: False
2024-02-05,08:00:21 | INFO |   csv_caption_key: title
2024-02-05,08:00:21 | INFO |   csv_img_key: filepath
2024-02-05,08:00:21 | INFO |   csv_separator:   
2024-02-05,08:00:21 | INFO |   dataset_resampled: False
2024-02-05,08:00:21 | INFO |   dataset_type: auto
2024-02-05,08:00:21 | INFO |   ddp_static_graph: False
2024-02-05,08:00:21 | INFO |   debug: False
2024-02-05,08:00:21 | INFO |   delete_previous_checkpoint: False
2024-02-05,08:00:21 | INFO |   device: npu:0
2024-02-05,08:00:21 | INFO |   dist_backend: nccl
2024-02-05,08:00:21 | INFO |   dist_url: env://
2024-02-05,08:00:21 | INFO |   distill: False
2024-02-05,08:00:21 | INFO |   distill_model: None
2024-02-05,08:00:21 | INFO |   distill_pretrained: None
2024-02-05,08:00:21 | INFO |   distributed: False
2024-02-05,08:00:21 | INFO |   epochs: 32
2024-02-05,08:00:21 | INFO |   epochs_cooldown: None
2024-02-05,08:00:21 | INFO |   eps: 1e-06
2024-02-05,08:00:21 | INFO |   force_custom_text: False
2024-02-05,08:00:21 | INFO |   force_image_size: None
2024-02-05,08:00:21 | INFO |   force_patch_dropout: None
2024-02-05,08:00:21 | INFO |   force_quick_gelu: False
2024-02-05,08:00:21 | INFO |   gather_with_grad: False
2024-02-05,08:00:21 | INFO |   grad_checkpointing: False
2024-02-05,08:00:21 | INFO |   grad_clip_norm: None
2024-02-05,08:00:21 | INFO |   horovod: False
2024-02-05,08:00:21 | INFO |   image_interpolation: None
2024-02-05,08:00:21 | INFO |   image_mean: None
2024-02-05,08:00:21 | INFO |   image_resize_mode: None
2024-02-05,08:00:21 | INFO |   image_std: None
2024-02-05,08:00:21 | INFO |   imagenet_v2: None
2024-02-05,08:00:21 | INFO |   imagenet_val: ./data/ImageNet-1000/val
2024-02-05,08:00:21 | INFO |   local_loss: False
2024-02-05,08:00:21 | INFO |   local_rank: 0
2024-02-05,08:00:21 | INFO |   lock_image: False
2024-02-05,08:00:21 | INFO |   lock_image_freeze_bn_stats: False
2024-02-05,08:00:21 | INFO |   lock_image_unlocked_groups: 0
2024-02-05,08:00:21 | INFO |   lock_text: False
2024-02-05,08:00:21 | INFO |   lock_text_freeze_layer_norm: False
2024-02-05,08:00:21 | INFO |   lock_text_unlocked_layers: 0
2024-02-05,08:00:21 | INFO |   log_every_n_steps: 100
2024-02-05,08:00:21 | INFO |   log_level: 20
2024-02-05,08:00:21 | INFO |   log_local: False
2024-02-05,08:00:21 | INFO |   log_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/out.log
2024-02-05,08:00:21 | INFO |   logs: ./logs/
2024-02-05,08:00:21 | INFO |   lr: 0.0005
2024-02-05,08:00:21 | INFO |   lr_cooldown_end: 0.0
2024-02-05,08:00:21 | INFO |   lr_cooldown_power: 1.0
2024-02-05,08:00:21 | INFO |   lr_scheduler: cosine
2024-02-05,08:00:21 | INFO |   model: ViT-L-14
2024-02-05,08:00:21 | INFO |   name: 2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp
2024-02-05,08:00:21 | INFO |   no_set_device_rank: False
2024-02-05,08:00:21 | INFO |   precision: amp
2024-02-05,08:00:21 | INFO |   pretrained: ./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin
2024-02-05,08:00:21 | INFO |   pretrained_image: False
2024-02-05,08:00:21 | INFO |   rank: 0
2024-02-05,08:00:21 | INFO |   remote_sync: None
2024-02-05,08:00:21 | INFO |   remote_sync_frequency: 300
2024-02-05,08:00:21 | INFO |   remote_sync_protocol: s3
2024-02-05,08:00:21 | INFO |   report_to: 
2024-02-05,08:00:21 | INFO |   resume: None
2024-02-05,08:00:21 | INFO |   save_frequency: 1
2024-02-05,08:00:21 | INFO |   save_most_recent: False
2024-02-05,08:00:21 | INFO |   seed: 0
2024-02-05,08:00:21 | INFO |   siglip: False
2024-02-05,08:00:21 | INFO |   skip_scheduler: False
2024-02-05,08:00:21 | INFO |   tensorboard: False
2024-02-05,08:00:21 | INFO |   tensorboard_path: 
2024-02-05,08:00:21 | INFO |   torchcompile: False
2024-02-05,08:00:21 | INFO |   torchscript: False
2024-02-05,08:00:21 | INFO |   trace: False
2024-02-05,08:00:21 | INFO |   train_data: None
2024-02-05,08:00:21 | INFO |   train_data_upsampling_factors: None
2024-02-05,08:00:21 | INFO |   train_num_samples: None
2024-02-05,08:00:21 | INFO |   use_bn_sync: False
2024-02-05,08:00:21 | INFO |   use_bnb_linear: None
2024-02-05,08:00:21 | INFO |   val_data: None
2024-02-05,08:00:21 | INFO |   val_frequency: 1
2024-02-05,08:00:21 | INFO |   val_num_samples: None
2024-02-05,08:00:21 | INFO |   wandb: False
2024-02-05,08:00:21 | INFO |   wandb_notes: 
2024-02-05,08:00:21 | INFO |   wandb_project_name: open-clip
2024-02-05,08:00:21 | INFO |   warmup: 10000
2024-02-05,08:00:21 | INFO |   wd: 0.2
2024-02-05,08:00:21 | INFO |   workers: 4
2024-02-05,08:00:21 | INFO |   world_size: 1
2024-02-05,08:00:21 | INFO |   zeroshot_frequency: 2
2024-02-05,08:00:21 | INFO | Starting zero-shot imagenet.
2024-02-05,08:00:21 | INFO | Building zero-shot classifier
2024-02-05,08:01:13 | INFO | Using classifier
2024-02-05,08:02:09 | INFO | Finished zero-shot imagenet.
2024-02-05,08:02:09 | INFO | Eval Epoch: 0 imagenet-zeroshot-val-top1: 0.7889   imagenet-zeroshot-val-top5: 0.9546
rom1504 commented 5 months ago

Cool! How is the inference and training speed?

On Mon, Feb 5, 2024, 9:15 AM Mengqing Cao @.***> wrote:

openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backend:

And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well.

eval on npu run with:

python3 -m training.main \ --model ViT-L-14 \ --pretrained "./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin" \ --seed 0 \ --imagenet-val './data/ImageNet-1000/val'

The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

The evaluation results of ViT-L-14 on npu:

detailed training logs:

2024-02-05,08:00:10 | INFO | Running with a single process. Device npu:0. 2024-02-05,08:00:10 | INFO | Loaded ViT-L-14 model config. 2024-02-05,08:00:17 | INFO | Loading pretrained ViT-L-14 weights (./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin). 2024-02-05,08:00:21 | INFO | Model: 2024-02-05,08:00:21 | INFO | CLIP( (visual): VisionTransformer( (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False) (patch_dropout): Identity() (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (transformer): Transformer( (resblocks): ModuleList( (0-23): 24 x ResidualAttentionBlock( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=1024, out_features=4096, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=4096, out_features=1024, bias=True) ) (ls_2): Identity() ) ) ) (ln_post): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (transformer): Transformer( (resblocks): ModuleList( (0-11): 12 x ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) ) ) (token_embedding): Embedding(49408, 768) (ln_final): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) 2024-02-05,08:00:21 | INFO | Params: 2024-02-05,08:00:21 | INFO | accum_freq: 1 2024-02-05,08:00:21 | INFO | aug_cfg: {} 2024-02-05,08:00:21 | INFO | batch_size: 64 2024-02-05,08:00:21 | INFO | beta1: 0.9 2024-02-05,08:00:21 | INFO | beta2: 0.98 2024-02-05,08:00:21 | INFO | checkpoint_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/checkpoints 2024-02-05,08:00:21 | INFO | coca_caption_loss_weight: 2.0 2024-02-05,08:00:21 | INFO | coca_contrastive_loss_weight: 1.0 2024-02-05,08:00:21 | INFO | copy_codebase: False 2024-02-05,08:00:21 | INFO | csv_caption_key: title 2024-02-05,08:00:21 | INFO | csv_img_key: filepath 2024-02-05,08:00:21 | INFO | csv_separator:
2024-02-05,08:00:21 | INFO | dataset_resampled: False 2024-02-05,08:00:21 | INFO | dataset_type: auto 2024-02-05,08:00:21 | INFO | ddp_static_graph: False 2024-02-05,08:00:21 | INFO | debug: False 2024-02-05,08:00:21 | INFO | delete_previous_checkpoint: False 2024-02-05,08:00:21 | INFO | device: npu:0 2024-02-05,08:00:21 | INFO | dist_backend: nccl 2024-02-05,08:00:21 | INFO | dist_url: env:// 2024-02-05,08:00:21 | INFO | distill: False 2024-02-05,08:00:21 | INFO | distill_model: None 2024-02-05,08:00:21 | INFO | distill_pretrained: None 2024-02-05,08:00:21 | INFO | distributed: False 2024-02-05,08:00:21 | INFO | epochs: 32 2024-02-05,08:00:21 | INFO | epochs_cooldown: None 2024-02-05,08:00:21 | INFO | eps: 1e-06 2024-02-05,08:00:21 | INFO | force_custom_text: False 2024-02-05,08:00:21 | INFO | force_image_size: None 2024-02-05,08:00:21 | INFO | force_patch_dropout: None 2024-02-05,08:00:21 | INFO | force_quick_gelu: False 2024-02-05,08:00:21 | INFO | gather_with_grad: False 2024-02-05,08:00:21 | INFO | grad_checkpointing: False 2024-02-05,08:00:21 | INFO | grad_clip_norm: None 2024-02-05,08:00:21 | INFO | horovod: False 2024-02-05,08:00:21 | INFO | image_interpolation: None 2024-02-05,08:00:21 | INFO | image_mean: None 2024-02-05,08:00:21 | INFO | image_resize_mode: None 2024-02-05,08:00:21 | INFO | image_std: None 2024-02-05,08:00:21 | INFO | imagenet_v2: None 2024-02-05,08:00:21 | INFO | imagenet_val: ./data/ImageNet-1000/val 2024-02-05,08:00:21 | INFO | local_loss: False 2024-02-05,08:00:21 | INFO | local_rank: 0 2024-02-05,08:00:21 | INFO | lock_image: False 2024-02-05,08:00:21 | INFO | lock_image_freeze_bn_stats: False 2024-02-05,08:00:21 | INFO | lock_image_unlocked_groups: 0 2024-02-05,08:00:21 | INFO | lock_text: False 2024-02-05,08:00:21 | INFO | lock_text_freeze_layer_norm: False 2024-02-05,08:00:21 | INFO | lock_text_unlocked_layers: 0 2024-02-05,08:00:21 | INFO | log_every_n_steps: 100 2024-02-05,08:00:21 | INFO | log_level: 20 2024-02-05,08:00:21 | INFO | log_local: False 2024-02-05,08:00:21 | INFO | log_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/out.log 2024-02-05,08:00:21 | INFO | logs: ./logs/ 2024-02-05,08:00:21 | INFO | lr: 0.0005 2024-02-05,08:00:21 | INFO | lr_cooldown_end: 0.0 2024-02-05,08:00:21 | INFO | lr_cooldown_power: 1.0 2024-02-05,08:00:21 | INFO | lr_scheduler: cosine 2024-02-05,08:00:21 | INFO | model: ViT-L-14 2024-02-05,08:00:21 | INFO | name: 2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp 2024-02-05,08:00:21 | INFO | no_set_device_rank: False 2024-02-05,08:00:21 | INFO | precision: amp 2024-02-05,08:00:21 | INFO | pretrained: ./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin 2024-02-05,08:00:21 | INFO | pretrained_image: False 2024-02-05,08:00:21 | INFO | rank: 0 2024-02-05,08:00:21 | INFO | remote_sync: None 2024-02-05,08:00:21 | INFO | remote_sync_frequency: 300 2024-02-05,08:00:21 | INFO | remote_sync_protocol: s3 2024-02-05,08:00:21 | INFO | report_to: 2024-02-05,08:00:21 | INFO | resume: None 2024-02-05,08:00:21 | INFO | save_frequency: 1 2024-02-05,08:00:21 | INFO | save_most_recent: False 2024-02-05,08:00:21 | INFO | seed: 0 2024-02-05,08:00:21 | INFO | siglip: False 2024-02-05,08:00:21 | INFO | skip_scheduler: False 2024-02-05,08:00:21 | INFO | tensorboard: False 2024-02-05,08:00:21 | INFO | tensorboard_path: 2024-02-05,08:00:21 | INFO | torchcompile: False 2024-02-05,08:00:21 | INFO | torchscript: False 2024-02-05,08:00:21 | INFO | trace: False 2024-02-05,08:00:21 | INFO | train_data: None 2024-02-05,08:00:21 | INFO | train_data_upsampling_factors: None 2024-02-05,08:00:21 | INFO | train_num_samples: None 2024-02-05,08:00:21 | INFO | use_bn_sync: False 2024-02-05,08:00:21 | INFO | use_bnb_linear: None 2024-02-05,08:00:21 | INFO | val_data: None 2024-02-05,08:00:21 | INFO | val_frequency: 1 2024-02-05,08:00:21 | INFO | val_num_samples: None 2024-02-05,08:00:21 | INFO | wandb: False 2024-02-05,08:00:21 | INFO | wandb_notes: 2024-02-05,08:00:21 | INFO | wandb_project_name: open-clip 2024-02-05,08:00:21 | INFO | warmup: 10000 2024-02-05,08:00:21 | INFO | wd: 0.2 2024-02-05,08:00:21 | INFO | workers: 4 2024-02-05,08:00:21 | INFO | world_size: 1 2024-02-05,08:00:21 | INFO | zeroshot_frequency: 2 2024-02-05,08:00:21 | INFO | Starting zero-shot imagenet. 2024-02-05,08:00:21 | INFO | Building zero-shot classifier 2024-02-05,08:01:13 | INFO | Using classifier 2024-02-05,08:02:09 | INFO | Finished zero-shot imagenet. 2024-02-05,08:02:09 | INFO | Eval Epoch: 0 imagenet-zeroshot-val-top1: 0.7889 imagenet-zeroshot-val-top5: 0.9546


You can view, comment on, or merge this pull request online at:

https://github.com/mlfoundations/open_clip/pull/813 Commit Summary

File Changes

(5 files https://github.com/mlfoundations/open_clip/pull/813/files)

Patch Links:

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/pull/813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XGQNIGQOL7NQBGUQ3YSCIJZAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYTOOJSG44TAOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

MengqingCao commented 5 months ago

Cool! How is the inference and training speed?

Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image

So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required?

rom1504 commented 5 months ago

A metric we usually look at is the sample/s per accelerator.

Some baselines: on one 3080 GPUs

Usually increasing the batch size to values like 256 help.

For training on one A100 it looks like

Usually with batch sizes around 128 per GPU.

I think it would be very interesting to have similar numbers on NPU

On Mon, Feb 5, 2024, 1:30 PM Mengqing Cao @.***> wrote:

Cool! How is the inference and training speed?

Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image.png (view on web) https://github.com/mlfoundations/open_clip/assets/52243582/30356825-6496-4c4d-af64-79d4b31890d6

So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required?

— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/pull/813#issuecomment-1926889090, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437U4EVDRNYURNDJSQ3TYSDGEVAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWHA4DSMBZGA . You are receiving this because you commented.Message ID: @.***>

MengqingCao commented 5 months ago

A metric we usually look at is the sample/s per accelerator. Some baselines: on one 3080 GPUs - B/32 inference speed is about 1300 sample/s - L/14 is about 300 sample/s Usually increasing the batch size to values like 256 help. For training on one A100 it looks like - 250 sample/s for B/32 (can be more if using less accelerators, hence having less interconnect bottleneck) - 80 sample/s for L/14 Usually with batch sizes around 128 per GPU. I think it would be very interesting to have similar numbers on NPU On Mon, Feb 5, 2024, 1:30 PM Mengqing Cao @.> wrote: Cool! How is the inference and training speed? Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image.png (view on web) https://github.com/mlfoundations/open_clip/assets/52243582/30356825-6496-4c4d-af64-79d4b31890d6 So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required? — Reply to this email directly, view it on GitHub <#813 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437U4EVDRNYURNDJSQ3TYSDGEVAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWHA4DSMBZGA . You are receiving this because you commented.Message ID: @.>

sorry for the late reply and thanks for your explanation.

I've noticed that code implementations of this metric exist in the training pipeline, and it is named samples_per_second_per_gpu in src/training/train.py. I have tested the sample/s metric of npu on the training pipeline with the following results:

I'm a bit confused whether the inference speed you mentioned is the process of evaluating the CLIP model, or the inference process of using the CLIP model for zero-shot image classification?

Screenshots

B/32: image

L/14: image

MengqingCao commented 4 months ago

@rom1504 Hi, weeks went, if there is any suggestions or concerns, plz let me know and I'll address them as soon.

MengqingCao commented 3 months ago

Could anyone help for reviewing? Thx 👍 @rom1504 @rwightman @gabrielilharco @bryant1410 @mitchellnw

MengqingCao commented 1 month ago

Sorry for bothering you. Could you help for reviewing this PR? @rwightman @gabrielilharco