hkchengrex / Cutie

[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation
https://hkchengrex.com/Cutie/
MIT License
579 stars 60 forks source link

Training low perf #58

Open bhack opened 2 months ago

bhack commented 2 months ago

Can I ask you a bit of details about your perf number on base model training? How much time require a forward and backward pass? How much time the dataloader?

I find it very hard to have a minimum decent GPU load also using a local SSD for the data. I've also tested with a similar setup as your paper with A100 GPUs.

Have you tested it with Pytorch 2.x?

hkchengrex commented 2 months ago

I didn't time each component specifically. For pre-training/main training, each iteration took around 0.28/0.66 seconds. The GPU load should be near 99% most of the time -- which means the GPU shouldn't have to wait for the dataloader at all. Sometimes CPUs can be a bottleneck, but that really depends on the hardware. I have tested this PyTorch 2.0 and didn't observe any significant slowdown.

bhack commented 2 months ago

Thanks it is important to have a reference. In my tests we are barely under 20% of GPUs occupancy... I am investigating.

bhack commented 2 months ago

I've really tested the original and compiled code with last stable pytorch, pytorch nightly, with A100, H100, with different number of workers, different number of GPU, with a larger batch size, with local ssd, using larger image like Davis fullres, with larger crops to fill the memory.

In any of these configurations I've achieved a decent GPU load with the base model.

hkchengrex commented 2 months ago

You have (all good then)? Or you haven't...?

bhack commented 2 months ago

No, in the best combo the load it is always around 20%.

hkchengrex commented 2 months ago

I see. I think there is a typo in your previous comment. How is the CPU usage (like, with top)?

bhack commented 2 months ago

it is quite high.. of course it depends by the num of workers. E.g. The H100 instance have 207 cores with 98 workers and batch size 32 we have an avg CPU load 50/55%.

hkchengrex commented 2 months ago

I just tried with the latest code and PyTorch (small model). This is on a different machine and I had to increase the number of workers in the pre-training stage to 32. I couldn't get it to 90+ utilization on average, but it is a lot better than 20%. With this utilization the avg_time is similar -- 0.283/0.801 for pre-training/main training after warm-up. The pre-training stage is more CPU-intensive and has a lower GPU utilization.

For reference, below are the screenshots during pre-training and main training respectively. It is likely that with better GPUs like H100, the CPUs would need to work extra hard to keep the GPUs fed but in any case, they should not be slower than the 0.283/0.801 avg_time.

Pre-training: Screenshot from 2024-04-13 14-26-56

Main training: Screenshot from 2024-04-13 14-18-47

hkchengrex commented 2 months ago

Are you getting "good" avg_time?

bhack commented 2 months ago

Currently I am testing only the main_training stage. With more ram on the H100 I've increased the batch size to 32 and num_workers to 64 but I've test also up to 128 with 8 GPU. Just to check also the balance between the file transfer and the processing and the network load I've also tried to use Davis full-res instead of Davis 480p doubling the crop_size.

With batch_size 32 and doubled crop_size we have avg_time:~3.2.