Closed charlesfrye closed 1 month ago
🚀 The docs preview is ready! Check it out here: https://modal-labs-examples--frontend-preview-6a2354c.modal.run
🚀 The docs preview is ready! Check it out here: https://modal-labs-examples--frontend-preview-b707869.modal.run
Thanks for updating!
That means we can't use the @decorator style we normally use for Functions, but it's a small price to pay. I think we might want to use this alternative style more in our examples to make it more obvious that code can be brought in from elsewhere, without the Modal integration being defined "in-line".
My personal preference here is to still use a thin, decorated function that calls the train
function defined on top. Although decorators are just higher-order functions, IMO we should be consistent with mostly using them as decorators so it doesn't confuse users who don't know much Python (which, sadly, is a big group).
Turns out
--detach
ing is incompatible with thetry
/catch
style the previous example used. With that pattern, the local environment was providing the durability, so the local process terminating would lead to the training "job" terminating.Following a suggestion from @aksh-at, I reworked the example to use
modal.Retries
instead. This gives you up to 10 days of execution without manual interruption, which ought to be enough for anybody.I also made some changes to improve the flow.
I now define the
train
function right at the top -- it roughly resembles the kind of function someone might bring to Modal, so it makes a great starting point. That means we can't use the@decorator
style we normally use for Functions, but it's a small price to pay. I think we might want to use this alternative style more in our examples to make it more obvious that code can be brought in from elsewhere, without the Modal integration being defined "in-line".I dropped the
nvidia/cuda
image. The-base
image doesn't bring in anything we either don't have already or that PyTorch doesn't install on its own. We only need it for things like host-managed JIT kernel compilation withnvrtc
or libraries that expect CUDA to be installed system-wide.I split the definition of the resources used in the
app.function
call out from the call itself. This gives us more room to explain them, which is nice in cases like this one where we need to explain several (timeout
,retries
, andvolumes
).I moved all of the Lightning code into the bottom of the file, after the
local_entrypoint
-- out-of-sight without requiring a separate file.Finally, I added experiment IDs so that our re-execution of the example in synmon would actually exercise most of the logic (not all: the data on the volume is persistent).