rework long-training example

charlesfrye commented 1 month ago

Turns out --detaching is incompatible with the try/catch style the previous example used. With that pattern, the local environment was providing the durability, so the local process terminating would lead to the training "job" terminating.

Following a suggestion from @aksh-at, I reworked the example to use modal.Retries instead. This gives you up to 10 days of execution without manual interruption, which ought to be enough for anybody.

I also made some changes to improve the flow.

I now define the train function right at the top -- it roughly resembles the kind of function someone might bring to Modal, so it makes a great starting point. That means we can't use the @decorator style we normally use for Functions, but it's a small price to pay. I think we might want to use this alternative style more in our examples to make it more obvious that code can be brought in from elsewhere, without the Modal integration being defined "in-line".

I dropped the nvidia/cuda image. The -base image doesn't bring in anything we either don't have already or that PyTorch doesn't install on its own. We only need it for things like host-managed JIT kernel compilation with nvrtc or libraries that expect CUDA to be installed system-wide.

I split the definition of the resources used in the app.function call out from the call itself. This gives us more room to explain them, which is nice in cases like this one where we need to explain several (timeout, retries, and volumes).

I moved all of the Lightning code into the bottom of the file, after the local_entrypoint -- out-of-sight without requiring a separate file.

Finally, I added experiment IDs so that our re-execution of the example in synmon would actually exercise most of the logic (not all: the data on the volume is persistent).

charlesfrye commented 1 month ago

🚀 The docs preview is ready! Check it out here: https://modal-labs-examples--frontend-preview-6a2354c.modal.run

charlesfrye commented 1 month ago

🚀 The docs preview is ready! Check it out here: https://modal-labs-examples--frontend-preview-b707869.modal.run

aksh-at commented 1 month ago

Thanks for updating!

That means we can't use the @decorator style we normally use for Functions, but it's a small price to pay. I think we might want to use this alternative style more in our examples to make it more obvious that code can be brought in from elsewhere, without the Modal integration being defined "in-line".

My personal preference here is to still use a thin, decorated function that calls the train function defined on top. Although decorators are just higher-order functions, IMO we should be consistent with mostly using them as decorators so it doesn't confuse users who don't know much Python (which, sadly, is a big group).

modal-labs / modal-examples

rework long-training example #906