microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.34k stars 4.1k forks source link

Generic Tensorboard Utility in Deepspeed #327

Closed rsn870 closed 1 year ago

rsn870 commented 4 years ago

Hi ,

The tensorboard config present by default in Deepspeed only provides logging for train/loss and train/lr . This feature however is quite limited , for eg in my training : inputs = data.to(model_engine.device) mean_style = mean_style.to(model_engine.device) latent_rnd = torch.randn(micro_batch_size, 512).to(model_engine.device) losses , encode_lst = model_engine(inputs,mean_style,latent_rnd,mode='train') total = torch.sum(torch.stack(losses))

It is total which I pass to backward as loss thus I would get graphs only for this . However I am interested in viewing the graphs for each individual component of losses list as well to finetune optimization if necessary , this however requires me to use an external tensorboard.

Instead of using the default config then can a utility be made to insert a 'customizable' tensorboard if possible ?

tjruwase commented 4 years ago

@rsn870 thanks so much identifying these great improvements to DeepSpeed. Yes, our current tensorboard support is very limited, but we are interested in improvements that users will find useful. Can you share more description of your model?

rsn870 commented 4 years ago

So my model is basically used for style transfer purposes. Most style transfer sceanrios deal with multiple losses and it is quite necessary to have good visualisations of the landscape of each individual loss.

A generic tensorboard utility integrated into deepspeed would save some effort.

I would also love to see some utilities that can finetune more complex optimization scenarios involving mutliple models and losses at the same time.

ShadenSmith commented 4 years ago

A simple first step might be to give users an access point to our SummaryWriter object in order to allow for arbitrary logging from client codes? We could also provide a simple log_scalar() interface so folks can do some simple logging without diving into TensorBoard.

I can see some confusion coming from the asymmetry of the SummaryWriter only being present on global rank 0.

lekurile commented 1 year ago

Hi @rsn870,

Since the logging capabilities were expanded in GH-2013 to support TensorBoard, WandB, and CSV logging formats (see documentation here), I'll close the issue for now.

Feel free to open another issue if there are additional requests for expanded logging capabilties.

Thanks, Lev