microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.61k stars 220 forks source link

TinCLIP training log #215

Open Gumpest opened 5 months ago

Gumpest commented 5 months ago

In my reproduction of auto_weight_inherit_100to75.sh, the imagenet-zeroshot-val-top1 is 0.0010 in Train Epoch: 0 [2501/48828]. I wonder about the situation weather is normal.

Gumpest commented 5 months ago

The wandb log.

Gumpest commented 5 months ago

It happened in Cross-Modal distillation process.

wkcn commented 5 months ago

@Gumpest I observed --train-data synthetic in the training command.

Did you replace the dataloader with the one loading LAION-400M image-text pairs?

Gumpest commented 5 months ago

@wkcn Oh, I didn't do that. The step is not mentioned in the docs. Do you have detailed information.

wkcn commented 5 months ago

Sorry for that. Regarding to the data loader, you can refer to the repo OpenCLIP (https://github.com/mlfoundations/open_clip?tab=readme-ov-file#data).

Gumpest commented 5 months ago

@wkcn Sorry to bother you, (https://github.com/mlfoundations/open_clip?tab=readme-ov-file#data) tells me how to download the laion-400m dataset, and "replace the dataloader with the one loading LAION-400M image-text pairs" means what😂

Gumpest commented 5 months ago

@wkcn or please provide the script to train with YFCC.

wkcn commented 5 months ago

@Gumpest Sorry for late reply.

@wkcn Sorry to bother you, (https://github.com/mlfoundations/open_clip?tab=readme-ov-file#data) tells me how to download the laion-400m dataset, and "replace the dataloader with the one loading LAION-400M image-text pairs" means what😂

In our scripts, --train-data and --dataset-type are both synthetic. You need to replace it in order to load the LAION-400M or YFCC-15M datasets.

@wkcn or please provide the script to train with YFCC.

Here are the hyper-parameters on YFCC.

On YFCC-15M, it contains 2 compression stages, where the training epochs are both 25 from 100% to 50% parameters, and 50% to 10%. We follow the hyper-parameter of CLIP except that the learning rate is set to 10^−4 when using weight inheritance.

Fig. 7 in Supplementary Material

Screenshot 2024-01-21 at 20 36 02

Stage 1: CLIP-VIT-16 to TinyCLIP-ViT-39M-16-Text-19M (manual inheritance, 100% to 50%)

export NNODES=1
export GPUS_PER_NODE=8

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES"
torchrun $DISTRIBUTED_ARGS src/training/main.py \
 --save-frequency 1 \
 --report-to wandb \
 --train-data <your_yfcc_path/> \
 --dataset-type webdataset \
 --imagenet-val ./ImageNet \
 --warmup 2000 \
 --batch-size 512 \
 --epochs 25 \
 --workers 8 \
 --model TinyCLIP-ViT-39M-16-Text-19M \
 --name exp_name \
 --seed 0 \
 --local-loss \
 --grad-checkpointing \
 --logs ./outputs/TinyCLIP-ViT-39M-16-Text-19M \
 --lr 0.0001 \
 --gather-with-grad \
 --pretrained-image-file ViT-B-16@openai \
 --pretrained-text-file ViT-B-16@openai \
 --distillation-teacher ViT-B-32@laion2b_e16 \
 --logit-scale 50 \
 --norm_gradient_clip 5 \
 --train-num-samples 15000000

Stage 2: TinyCLIP-ViT-39M-16-Text-19M to TinyCLIP-ViT-8M-16-Text-3M (manual inheritance, 50% to 10%)

export NNODES=1
export GPUS_PER_NODE=8

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES"
torchrun $DISTRIBUTED_ARGS src/training/main.py \
 --save-frequency 1 \
 --report-to wandb \
 --train-data <your_yfcc_path/> \
 --dataset-type webdataset \
 --imagenet-val ./ImageNet \
 --warmup 2000 \
 --batch-size 512 \
 --epochs 25 \
 --workers 8 \
 --model TinyCLIP-ViT-8M-16-Text-3M \
 --name exp_name \
 --seed 0 \
 --local-loss \
 --grad-checkpointing \
 --logs ./outputs/TinyCLIP-ViT-8M-16-Text-3M \
 --lr 0.0001 \
 --gather-with-grad \
 --pretrained-image-file checkpoints/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M.pt \
 --pretrained-text-file checkpoints/TinyCLIP-ViT-39M-16-Text-19M-YFCC15M.pt \
 --distillation-teacher ViT-B-32@laion2b_e16 \
 --logit-scale 50 \
 --norm_gradient_clip 5 \
 --train-num-samples 15000000