microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.69k stars 227 forks source link

Hello, I have encountered four problems in total when running the script/auto_weight_inherit_100to75.sh of tinyclip, I have solved these four problems, but I am not sure whether my solution will affect the accuracy of the model for the fourth problem, could you give me some advice, thanks! #248

Open leo23ui opened 6 hours ago

leo23ui commented 6 hours ago

Hello, I faced four problems when running script, I have solved the first three problems, but the fourth problem is not solved, could you please give me some suggestions, thank you!


Because train.py's dataloader does not have batch_size (but --batch-size 512 is set in sh), So I changed train.py's ///batch_size = dataloader.batch_size/// to ///batch_size = getattr(dataloader, 'batch_size', 512)/// and the error disappeared


Since the error displayed that data.py did not have args.train_data_upsampling_factors (train_data_upsampling_factors was not set in sh), I deleted the following content and the judgment statement. Changed to ///pipeline = [wds.SimpleShardList(input_shards)]///

/// if resampled: pipeline = [ResampledShards2( input_shards, weights=args.train_data_upsampling_factors, deterministic=True, epoch=shared_epoch, )] else: pipeline = [wds.SimpleShardList(input_shards)]///

///pipeline = [wds.SimpleShardList(input_shards)]///


Because the error displayed that /home/gg/gg/MQBench-main/test/model/tran1 is not a tar file, So I change the sh to the/home/gg/gg/MQBench-themain/test/model/tran1/ batch_1.tar.....But there is another problem, I have a lot of tar files, file references are also an issue here, but I'll leave that for now


Now the new error display as follows, KeyError: 'fname', could you please give me some advice, thank you!!

rank0]: File "/home/gg/miniconda3/envs/a/lib/python3.10/site-packages/webdataset/filters.py", line 221, in _shuffle rank0: buf.append(next(data)) # skipcq: PYL-R1708 rank0: File "/home/gg/gg/MQBench-main/test/model/TinyCLIP/src/training/data.py", line 212, in group_by_keys_nothrow rank0: fname, value = filesample["fname"], filesample["data"]

i solved it by add " if "fname" not in filesample or "data" not in filesample: continue" at def group_by_keys_nothrow, will this affect the accuracy of the model, hope to get your reply ,thanks!!

my dataset in tar is the format as follows: 10000000_106b46b0a6.jpg 10000000_106b46b0a6.txt 10000000_106b46b078.jpg 10000000_106b46b078.txt 10000000_106b46b0a9.jpg 10000000_106b46b0a9.txt

the following is my sh file: export NNODES=1 export GPUS_PER_NODE=1 export WANDB__SERVICE_WAIT=60

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES" torchrun $DISTRIBUTED_ARGS src/training/main.py \ --save-frequency 1 \ --report-to wandb \ --train-data /home/gg/gg/MQBench-main/test/model/tran1/batch_1.tar \ --dataset-type webdataset \ --imagenet-val ./ImageNet \ --warmup 2000 \ --batch-size 512 \ --epochs 25 \ --workers 1 \ --model TinyCLIP-ViT-39M-16-Text-19M \ --name exp_name \ --seed 0 \ --local-loss \ --grad-checkpointing \ --output ./outputs/TinyCLIP-ViT-39M-16-Text-19M \ --lr 0.0001 \ --gather-with-grad \ --pretrained-image-file ViT-B-16@openai \ --pretrained-text-file ViT-B-16@openai \ --distillation-teacher ViT-B-32@laion2b_e16 \ --norm_gradient_clip 5 \ --train-num-samples 100000 \ --logit-scale 50

wkcn commented 3 hours ago

You can check whether the input of the model is correct.

leo23ui commented 2 hours ago

You can check whether the input of the model is correct.

thanks for your reply!! i solved the above problem by add " if "fname" not in filesample or "data" not in filesample: continue" at def group_by_keys_nothrow.
But i find there is no model saved in checkpoints. could you please give me some advice, thanks!