iterative / studio-support

❓ DVC Studio Issues, Question, and Discussions
https://studio.iterative.ai
16 stars 1 forks source link

Streaming live metrics to DVC studio severely limits training speed #95

Closed NiklasKappel closed 4 months ago

NiklasKappel commented 4 months ago

Consider this example project complete with MWE and conda environment.

The MWE trains a dummy model with the lightning framework and logs metrics with the DVCLiveLogger. I ran python test.py without logging in to DVC studio, this is the output I got:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 4     
---------------------------------
4         Trainable params
0         Non-trainable params
4         Total params
0.000     Total estimated model params size (MB)
Epoch 99: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 92.19it/s, v_num=_run]
`Trainer.fit` stopped: `max_epochs=100` reached.
Epoch 99: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 92.03it/s, v_num=_run]

The whole training took 37.90 seconds to complete. Then I did dvc studio login and ran python test.py again. This time the training took 32.86 MINUTES to complete and the output slightly changed:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 4     
---------------------------------
4         Trainable params
0         Non-trainable params
4         Total params
0.000     Total estimated model params size (MB)
Epoch 17:  97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏    | 31/32 [00:20<00:00,  1.53it/s, v_num=_run]
WARNING:dvc_studio_client:Failed to post to Studio: {"code": 429, "detail": "You have exceeded your rate limits."}
WARNING:dvclive:`post_to_studio` `data` failed.
Epoch 18:  97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏    | 31/32 [00:18<00:00,  1.67it/s, v_num=_run]
WARNING:dvc_studio_client:Failed to post to Studio: {"code": 429, "detail": "You have exceeded your rate limits."}
WARNING:dvclive:`post_to_studio` `data` failed.
Epoch 99: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:19<00:00,  1.62it/s, v_num=_run]
`Trainer.fit` stopped: `max_epochs=100` reached.
Epoch 99: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:19<00:00,  1.62it/s, v_num=_run]

Note the intermittent "You have exceeded your rate limits" warnings. From what I can tell, training while logged into DVC studio is constantly slow, even before the first rate limit warning appears. If I dvc studio logout again, the training speed goes back to normal (~ 1 run / 40 seconds).

I really like DVC for its data versioning and pipeline management capabilities and I would like to use DVC studio for live metrics monitoring since it understands if I associate different pipeline stages with different dvclive/ outputs. So it would be very nice if we could figure out what is causing the training delay here. :)