syne-tune / syne-tune

Large scale and asynchronous Hyperparameter and Architecture Optimization at your fingertips.
https://syne-tune.readthedocs.io
Apache License 2.0
391 stars 51 forks source link

RTD Search broken + Question Regarding Status #601

Closed wistuba closed 1 year ago

wistuba commented 1 year ago

Search on RTD is broken, therefore I have to ask here whether a documentation exist.

I'm interested to understand when this line is reached: https://github.com/awslabs/syne-tune/blob/97cefe99686397b10a1e4a4e7e3ca6f66071cb95/syne_tune/tuner.py#L585

Is there a documentation for Status and SchedulerDecision?

wesk commented 1 year ago

Thanks for the report, I'm rebuilding the docs manually now to see if that fixes the search

wesk commented 1 year ago

Looks like a regression was introduced into the docs that broke search, looking into this. As an immediate mitigation, you can search using this older docs build: https://syne-tune--587.org.readthedocs.build/en/587/

See for example this search result for SchedulerDecision: https://syne-tune--587.org.readthedocs.build/en/587/_apidoc/syne_tune.optimizer.scheduler.html#syne_tune.optimizer.scheduler.SchedulerDecision

mseeger commented 1 year ago

Hello, SchedulerDecision is what is returned by scheduler.on_trial_result, which is triggered by a reported result being received. The value STOP means the running trial should be stopped.

Now, it could be that the trial has already finished, because it may finish just after returning the last report, because it was in the final epoch. In this case, status == Status.completed, and we don't have to stop it.

So, the line you ask about is reached when the scheduler, as reaction to a reported results, asks to stop the trial, but it has not yet finished on its own. We then ask the backend to stop it.

mseeger commented 1 year ago

SchedulerDecision is really simple, but Status is a bit more tricky, because there is stopped and stopping. David did this. I tend to ignore stopping, it may not really ever be used. For a SageMaker job, there is stopping, which is the state between active and stopped.

mseeger commented 1 year ago

@wesk Why would search in the docs be broken?

wistuba commented 1 year ago

Are you saying that a Python process will not be killed as soon as it reports for the final epoch? Or are you saying it won't be killed if it is already dead? (local backend)

mseeger commented 1 year ago

The second, I think. We only observe the Status.completed value when the job really ends. And in that case, we do not have to stop it again. But David wrote this, so I am a bit guessing. But I am pretty sure

mseeger commented 1 year ago

@wesk This issue is now primarily about the search not working in our docs. It works with old sphinx dependencies. I am checking what happens locally when the dependencies are re-installed

mseeger commented 1 year ago

OK. I can confirm that:

mseeger commented 1 year ago

I don't find any recent reports of search failing in sphinx. Dropping the ball here

mseeger commented 1 year ago

OK, broken search is fixed by #602 (thanks, Martin!), and docs improved in #603