mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.14k stars 908 forks source link

Help With Training Bottleneck #899

Closed ShijianXu closed 2 weeks ago

ShijianXu commented 2 weeks ago

Hi,

I am using my own created webdataset for the clip training. but it seems i encountered some training bottleneck, the GPU utilization is super low (wandb report)

below are my training params:

python -m training.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to wandb \
    --dataset-type webdataset \
    --train-data '/webdataset/dataset-train-{000000..000078}.tar'  \
    --train-num-samples 788603 \
    --warmup 10000 \
    --batch-size=256 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=200 \
    --workers=8 \
    --model RN18-1d

I am training on my local workstation, with NVIDIA 2080ti, 32G RAM.

Anyone know what's wrong with my training? Thanks a lot! Best regards.

ShijianXu commented 2 weeks ago

Found the problem. My data is stored on a slow HDD.