Open MengqingCao opened 5 months ago
Cool! How is the inference and training speed?
On Mon, Feb 5, 2024, 9:15 AM Mengqing Cao @.***> wrote:
openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backend:
- TPU support. #20 https://github.com/mlfoundations/open_clip/issues/20
- More backends support #796 https://github.com/mlfoundations/open_clip/discussions/796
And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well.
eval on npu run with:
python3 -m training.main \ --model ViT-L-14 \ --pretrained "./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin" \ --seed 0 \ --imagenet-val './data/ImageNet-1000/val'
The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
The evaluation results of ViT-L-14 on npu:
- imagenet-zeroshot-val-top1: 78.89%
- imagenet-zeroshot-val-top5: 95.46% image.png (view on web) https://github.com/mlfoundations/open_clip/assets/52243582/3df7fb0c-9928-4944-8a1a-e358240725b3 The results are close to that of gpu's (top-1 acc: 79.2%).
detailed training logs:
2024-02-05,08:00:10 | INFO | Running with a single process. Device npu:0. 2024-02-05,08:00:10 | INFO | Loaded ViT-L-14 model config. 2024-02-05,08:00:17 | INFO | Loading pretrained ViT-L-14 weights (./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin). 2024-02-05,08:00:21 | INFO | Model: 2024-02-05,08:00:21 | INFO | CLIP( (visual): VisionTransformer( (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False) (patch_dropout): Identity() (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (transformer): Transformer( (resblocks): ModuleList( (0-23): 24 x ResidualAttentionBlock( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=1024, out_features=4096, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=4096, out_features=1024, bias=True) ) (ls_2): Identity() ) ) ) (ln_post): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (transformer): Transformer( (resblocks): ModuleList( (0-11): 12 x ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) ) ) (token_embedding): Embedding(49408, 768) (ln_final): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) 2024-02-05,08:00:21 | INFO | Params: 2024-02-05,08:00:21 | INFO | accum_freq: 1 2024-02-05,08:00:21 | INFO | aug_cfg: {} 2024-02-05,08:00:21 | INFO | batch_size: 64 2024-02-05,08:00:21 | INFO | beta1: 0.9 2024-02-05,08:00:21 | INFO | beta2: 0.98 2024-02-05,08:00:21 | INFO | checkpoint_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/checkpoints 2024-02-05,08:00:21 | INFO | coca_caption_loss_weight: 2.0 2024-02-05,08:00:21 | INFO | coca_contrastive_loss_weight: 1.0 2024-02-05,08:00:21 | INFO | copy_codebase: False 2024-02-05,08:00:21 | INFO | csv_caption_key: title 2024-02-05,08:00:21 | INFO | csv_img_key: filepath 2024-02-05,08:00:21 | INFO | csv_separator:
2024-02-05,08:00:21 | INFO | dataset_resampled: False 2024-02-05,08:00:21 | INFO | dataset_type: auto 2024-02-05,08:00:21 | INFO | ddp_static_graph: False 2024-02-05,08:00:21 | INFO | debug: False 2024-02-05,08:00:21 | INFO | delete_previous_checkpoint: False 2024-02-05,08:00:21 | INFO | device: npu:0 2024-02-05,08:00:21 | INFO | dist_backend: nccl 2024-02-05,08:00:21 | INFO | dist_url: env:// 2024-02-05,08:00:21 | INFO | distill: False 2024-02-05,08:00:21 | INFO | distill_model: None 2024-02-05,08:00:21 | INFO | distill_pretrained: None 2024-02-05,08:00:21 | INFO | distributed: False 2024-02-05,08:00:21 | INFO | epochs: 32 2024-02-05,08:00:21 | INFO | epochs_cooldown: None 2024-02-05,08:00:21 | INFO | eps: 1e-06 2024-02-05,08:00:21 | INFO | force_custom_text: False 2024-02-05,08:00:21 | INFO | force_image_size: None 2024-02-05,08:00:21 | INFO | force_patch_dropout: None 2024-02-05,08:00:21 | INFO | force_quick_gelu: False 2024-02-05,08:00:21 | INFO | gather_with_grad: False 2024-02-05,08:00:21 | INFO | grad_checkpointing: False 2024-02-05,08:00:21 | INFO | grad_clip_norm: None 2024-02-05,08:00:21 | INFO | horovod: False 2024-02-05,08:00:21 | INFO | image_interpolation: None 2024-02-05,08:00:21 | INFO | image_mean: None 2024-02-05,08:00:21 | INFO | image_resize_mode: None 2024-02-05,08:00:21 | INFO | image_std: None 2024-02-05,08:00:21 | INFO | imagenet_v2: None 2024-02-05,08:00:21 | INFO | imagenet_val: ./data/ImageNet-1000/val 2024-02-05,08:00:21 | INFO | local_loss: False 2024-02-05,08:00:21 | INFO | local_rank: 0 2024-02-05,08:00:21 | INFO | lock_image: False 2024-02-05,08:00:21 | INFO | lock_image_freeze_bn_stats: False 2024-02-05,08:00:21 | INFO | lock_image_unlocked_groups: 0 2024-02-05,08:00:21 | INFO | lock_text: False 2024-02-05,08:00:21 | INFO | lock_text_freeze_layer_norm: False 2024-02-05,08:00:21 | INFO | lock_text_unlocked_layers: 0 2024-02-05,08:00:21 | INFO | log_every_n_steps: 100 2024-02-05,08:00:21 | INFO | log_level: 20 2024-02-05,08:00:21 | INFO | log_local: False 2024-02-05,08:00:21 | INFO | log_path: ./logs/2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp/out.log 2024-02-05,08:00:21 | INFO | logs: ./logs/ 2024-02-05,08:00:21 | INFO | lr: 0.0005 2024-02-05,08:00:21 | INFO | lr_cooldown_end: 0.0 2024-02-05,08:00:21 | INFO | lr_cooldown_power: 1.0 2024-02-05,08:00:21 | INFO | lr_scheduler: cosine 2024-02-05,08:00:21 | INFO | model: ViT-L-14 2024-02-05,08:00:21 | INFO | name: 2024_02_05-08_00_10-model_ViT-L-14-lr_0.0005-b_64-j_4-p_amp 2024-02-05,08:00:21 | INFO | no_set_device_rank: False 2024-02-05,08:00:21 | INFO | precision: amp 2024-02-05,08:00:21 | INFO | pretrained: ./models/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin 2024-02-05,08:00:21 | INFO | pretrained_image: False 2024-02-05,08:00:21 | INFO | rank: 0 2024-02-05,08:00:21 | INFO | remote_sync: None 2024-02-05,08:00:21 | INFO | remote_sync_frequency: 300 2024-02-05,08:00:21 | INFO | remote_sync_protocol: s3 2024-02-05,08:00:21 | INFO | report_to: 2024-02-05,08:00:21 | INFO | resume: None 2024-02-05,08:00:21 | INFO | save_frequency: 1 2024-02-05,08:00:21 | INFO | save_most_recent: False 2024-02-05,08:00:21 | INFO | seed: 0 2024-02-05,08:00:21 | INFO | siglip: False 2024-02-05,08:00:21 | INFO | skip_scheduler: False 2024-02-05,08:00:21 | INFO | tensorboard: False 2024-02-05,08:00:21 | INFO | tensorboard_path: 2024-02-05,08:00:21 | INFO | torchcompile: False 2024-02-05,08:00:21 | INFO | torchscript: False 2024-02-05,08:00:21 | INFO | trace: False 2024-02-05,08:00:21 | INFO | train_data: None 2024-02-05,08:00:21 | INFO | train_data_upsampling_factors: None 2024-02-05,08:00:21 | INFO | train_num_samples: None 2024-02-05,08:00:21 | INFO | use_bn_sync: False 2024-02-05,08:00:21 | INFO | use_bnb_linear: None 2024-02-05,08:00:21 | INFO | val_data: None 2024-02-05,08:00:21 | INFO | val_frequency: 1 2024-02-05,08:00:21 | INFO | val_num_samples: None 2024-02-05,08:00:21 | INFO | wandb: False 2024-02-05,08:00:21 | INFO | wandb_notes: 2024-02-05,08:00:21 | INFO | wandb_project_name: open-clip 2024-02-05,08:00:21 | INFO | warmup: 10000 2024-02-05,08:00:21 | INFO | wd: 0.2 2024-02-05,08:00:21 | INFO | workers: 4 2024-02-05,08:00:21 | INFO | world_size: 1 2024-02-05,08:00:21 | INFO | zeroshot_frequency: 2 2024-02-05,08:00:21 | INFO | Starting zero-shot imagenet. 2024-02-05,08:00:21 | INFO | Building zero-shot classifier 2024-02-05,08:01:13 | INFO | Using classifier 2024-02-05,08:02:09 | INFO | Finished zero-shot imagenet. 2024-02-05,08:02:09 | INFO | Eval Epoch: 0 imagenet-zeroshot-val-top1: 0.7889 imagenet-zeroshot-val-top5: 0.9546
You can view, comment on, or merge this pull request online at:
https://github.com/mlfoundations/open_clip/pull/813 Commit Summary
- a6a2032 https://github.com/mlfoundations/open_clip/pull/813/commits/a6a203262d419b32de94f6b0f6edb0aae1891172 add npu support
File Changes
(5 files https://github.com/mlfoundations/open_clip/pull/813/files)
- A requirements-npu.txt https://github.com/mlfoundations/open_clip/pull/813/files#diff-9b6c5e535fc5c475ff121268847e0dcd5d633fc27a6e0aa0781540ca7252e0e4 (7)
- M src/training/distributed.py https://github.com/mlfoundations/open_clip/pull/813/files#diff-467ce0e8c18cca22eccaee323a96ae4c702ff61cf45eee98530c4667453ca193 (9)
- M src/training/main.py https://github.com/mlfoundations/open_clip/pull/813/files#diff-8cac5527ae65d91d536016bb558349a70695c2e856a3e0526b21df7c69f9b8b2 (10)
- M src/training/precision.py https://github.com/mlfoundations/open_clip/pull/813/files#diff-fd92cef91b8b92ae70b1773f98ee605a215a10eaf7b54d794d00b25c6aa30571 (5)
- M src/training/profiler.py https://github.com/mlfoundations/open_clip/pull/813/files#diff-a98ec43d4829e757d6822f426daf81c934360a394502f3570e89112a4678a6c2 (7)
Patch Links:
- https://github.com/mlfoundations/open_clip/pull/813.patch
- https://github.com/mlfoundations/open_clip/pull/813.diff
— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/pull/813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XGQNIGQOL7NQBGUQ3YSCIJZAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYTOOJSG44TAOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Cool! How is the inference and training speed?
Your speed of reply is amazing! : )
As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device)
So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required?
A metric we usually look at is the sample/s per accelerator.
Some baselines: on one 3080 GPUs
Usually increasing the batch size to values like 256 help.
For training on one A100 it looks like
Usually with batch sizes around 128 per GPU.
I think it would be very interesting to have similar numbers on NPU
On Mon, Feb 5, 2024, 1:30 PM Mengqing Cao @.***> wrote:
Cool! How is the inference and training speed?
Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image.png (view on web) https://github.com/mlfoundations/open_clip/assets/52243582/30356825-6496-4c4d-af64-79d4b31890d6
So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required?
— Reply to this email directly, view it on GitHub https://github.com/mlfoundations/open_clip/pull/813#issuecomment-1926889090, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437U4EVDRNYURNDJSQ3TYSDGEVAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWHA4DSMBZGA . You are receiving this because you commented.Message ID: @.***>
A metric we usually look at is the sample/s per accelerator. Some baselines: on one 3080 GPUs - B/32 inference speed is about 1300 sample/s - L/14 is about 300 sample/s Usually increasing the batch size to values like 256 help. For training on one A100 it looks like - 250 sample/s for B/32 (can be more if using less accelerators, hence having less interconnect bottleneck) - 80 sample/s for L/14 Usually with batch sizes around 128 per GPU. I think it would be very interesting to have similar numbers on NPU … On Mon, Feb 5, 2024, 1:30 PM Mengqing Cao @.> wrote: Cool! How is the inference and training speed? Your speed of reply is amazing! : ) As the following pic shows, it takes around 55s for inferencing ViT-L-14 on ImageNet-1k validation dataset. (with batchsize=64 and 1 npu device) image.png (view on web) https://github.com/mlfoundations/open_clip/assets/52243582/30356825-6496-4c4d-af64-79d4b31890d6 So I think it's fast but I haven't tested the exact FLOPS. Is the FLOPS required? — Reply to this email directly, view it on GitHub <#813 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437U4EVDRNYURNDJSQ3TYSDGEVAVCNFSM6AAAAABCZV26HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWHA4DSMBZGA . You are receiving this because you commented.Message ID: @.>
sorry for the late reply and thanks for your explanation.
I've noticed that code implementations of this metric exist in the training pipeline, and it is named samples_per_second_per_gpu
in src/training/train.py
.
I have tested the sample/s
metric of npu on the training pipeline with the following results:
I'm a bit confused whether the inference speed
you mentioned is the process of evaluating the CLIP model, or the inference process of using the CLIP model for zero-shot image classification?
B/32:
L/14:
@rom1504 Hi, weeks went, if there is any suggestions or concerns, plz let me know and I'll address them as soon.
Could anyone help for reviewing? Thx 👍 @rom1504 @rwightman @gabrielilharco @bryant1410 @mitchellnw
Sorry for bothering you. Could you help for reviewing this PR? @rwightman @gabrielilharco
openclip performs great on CLIP model training and inference, but unfortunately, it seems to only support gpu and cpu at the moment. I notice that there is a need for other backends:
And this PR add Ascend NPU backend support. I test the NPU-support feature by eavluating the ViT-L-14 model on ImageNet-1k dataset, and everything goes well.
eval on npu run with:
The pretrained wights is downloaded from laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
The evaluation results of ViT-L-14 on npu:
detailed training logs: