Open azayz opened 1 week ago
cc @hongpeng-guo
I believe the async checkpoint you can do out of band using torch abilities. And for uploading the artifact you can use an extra thread to do that that doesn't block the training loop. But ideally this can be provided as a generic Ray interface using the object store. To enable distributed checkpointing, merging and layered storages. I can propose a REP design on that since we already experimented a bit on this direction.
Probably a best practice section somewhere in the docs is also helpful to Train users.
@Superskyyy That would be great! Maybe you could start with a quick sketch of your proposal as a github issue?
@Superskyyy That would be great! Maybe you could start with a quick sketch of your proposal as a github issue?
Cool, you mean in the main Ray repo right? not in the REP repo.
@Superskyyy Yep, just in the Ray repo for now.
Description
Hello Ray team, my team and I are using ray for training, the model we save is of size 13Gb and it takes around 20min to upload to S3 storage, in the mean time GPU workers are sitting and not doing anything.
In order to maximize the GPU usage, we want to do this upload in the background or asynchronously.
What is the recommended ray way to do this? if it doesn't can you support this? if it's also not on the ray side, it's fine.
Below is a sample of our code:
custom_fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs)) in the train_func:
Thank you!
Use case
No response