wandb / wandb

🔥 A tool for visualizing and tracking your machine learning experiments. This repo contains the CLI and Python API.
https://wandb.ai
MIT License
8.66k stars 643 forks source link

How to use the early terminate module in wandb? I do not quite understand the explaination in docs. #4481

Open zlq147 opened 1 year ago

zlq147 commented 1 year ago

I am trying to use wandb sweep to tune the hyperparameter in a model, and also try to use the hyperband early terminate method to accelerate it.

However, I don't understand how this mechanism works by looking up the docs https://docs.wandb.ai/guides/sweeps/define-sweep-configuration#early_terminate and the paper https://arxiv.org/abs/1603.06560.

In this paper, the author propose the concept of "resource". In my opinion, in the wandb setting, the "resource" should be num of training epochs. However, in the configuration of "early terminate", I can only see the parameter of "s", "eta", "min_iter" and "max_iter". And through the explaination of the docs, I do not understand the real meaning of them.

In the github examples, it is tough to see whether the early terminate takes effect, so I hope there will be a simple piece of code to explain how the early terminate works. I wonder if the logged metric shourld be "valid_acc".

I would be appreciated if anyone can help me understand what early terminate mechanism in wandb sweep actually do, especially the meaning of the parameters, and how to change the training code.

luisbergua commented 1 year ago

Hi @zlq147, thanks for your question! Here you have a more detailed explanation of the early termination, please let me know if this would be useful or if you have any other question.

luisbergua commented 1 year ago

Hi @zlq147, I wanted to follow up here! Was this information useful? Do you have any other question about this?

rbracco commented 1 year ago

I would like to add that while the current documentation is filled out and reasonably complete, I find it very hard to understand. I don't believe there are enough examples to understand what is going on. Here's precisely what I find unclear:

  1. Given that there is no simple explanation of hyperband (just a link to the paper), I can't tell if can be executed in a way that isn't exponential (what if I want to check for early stopping every 2nd epoch, e.g. linear, is that possible? or is it only every 2^nth epoch). What if my runs are 20 epochs, and I want my min_iter to be 3, is my only option for multiple brackets to set eta=2 and stop at epoch 3 or 9?
  2. I understand now, but it took several rereadings to really grasp the relationship between min_iter, brackets, iterations (steps, epochs or something else) and eta, but I guess that there's no way to make it much simpler as there are just a lot of variables and names here. I do think more examples would help here. Also if it were reiterated in the examples that the brackets correspond to logging interval (steps, epochs) it would make it more clear because it makes it more concrete in the users head for whatever their particular use case is.

I am currently traveling and am not set up for a PR but could do this starting next week if there's interest and someone could review. Thank you.

luisbergua commented 1 year ago

Hi @rbracco, sorry for the long delay here! Just wanted to let you know that I submitted your feedback internally to improve our docs and better explain how the early terminate module works. Thanks a lot for sharing the detailed explanation!

JackCai1206 commented 1 year ago

I'd also love to see some concrete examples without having to dive into the paper!

luisbergua commented 1 year ago

Hi @JackCai1206, thanks for sharing the feedback! I'll share this with our team!

royvelich commented 9 months ago

Same here. The docs are not clear IMHO.

ziimiin14 commented 6 months ago

Hi @luisbergua , just a follow up. Is there any progress on this issue? Facing same issue as stated above.

SuroshAhmadZobair commented 6 months ago

Hi

@luisbergua any updates?

luisbergua commented 4 months ago

Hi @SuroshAhmadZobair @ziimiin14, apologies for the delay. I'll bump the priority of this with our Docs Team