skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.2k stars 426 forks source link

[Feature Request] Azure Blob Storage for Sky Storage #1271

Open michaelzhiluo opened 1 year ago

michaelzhiluo commented 1 year ago

During a conversation with one of our potential users for Skypilot, he mentioned it would be a nice to have to support Azure Blob storage, since he only has credits for Azure.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 10 days with no activity.

zaptrem commented 7 months ago

Seconded. When training on Azure storing checkpoints within their ecosystem reduces bandwidth costs. Thanks!

Michaelvll commented 7 months ago

Thanks for requesting this feature @zaptrem! I re-opened the issue.

rafafael03 commented 5 months ago

Yes, this feature would be very useful in some situations. Just to mention some:

  1. The one already mentioned, when someone has credits on Azure.
  2. To work on an Azure-only project. This may be a requirement for some companies.
  3. For someone who is used to Azure not have to learn an entire new cloud just for using SkyPilot.

SkyPilot is an awesome tool and would be awesome to democratize it even more ❤️

Uelfol commented 5 months ago

Please add support for Microsoft Azure Blob. I would also very much like to use SkyPilot for my tasks.

landscapepainter commented 5 months ago

@rafafael03 @Uelfol Glad to support the feature. Azure Blob is definitely within our range and we'll start working on it!

landscapepainter commented 5 months ago

Hello @rafafael03 @Uelfol , I have a question regards to resource groups and stroage accounts. Is it important for you to be able to specify the name of the resource group and storage account when you create the container(the blob storage) through skypilot? Was wondeirng if it would affect your workflow if we keep the name of resource group, name of storage account, and name of container to be all identical where you provide the container name(or storage account) only.

rafafael03 commented 5 months ago

Hello @landscapepainter! Yes, for my use case it would be important to specify each field separately. Thank you for caring about these details :)

landscapepainter commented 5 months ago

@rafafael03 Thanks for the feedback. This is really helpful! Then, I have another clarifying question to ask: When using file mounts, there are two ways to use containers. One is to use the container that was created externally(not through skypilot), and another way to is to create a new container while running sky launch.

For the case of creating a new container while running sky launch, is it important for you to be able to specify multiple different storage accounts? In other words, do you have to create containers under different storage accounts when running sky launch? This is implemented for externally created containers, but not when you have to create new containers.

rafafael03 commented 4 months ago

@landscapepainter I'm happy to help!

For my use case, it could reuse the same storage account. Given that I have a project-based structure, then separating in different containers seems to be enough for me.

But I could imagine a situation where it would be useful. Let's say that someone has a storage account for reusable data. Then for this specific case, it would be interesting to be able to use different storage accounts (one for the reusable things, the other for project-specific things).

To meet all needs, it may be a good idea to have this option. Or not, if it adds lots of extra work 😅

pompeuesilvaGuilherme commented 3 months ago

Could you please incorporate support for Microsoft Azure Blob? It would be greatly beneficial for me to utilize SkyPilot in my tasks!

landscapepainter commented 2 months ago

@zaptrem @Uelfol @rafafael03 @pompeuesilvaGuilherme Thanks for your interest and patience! Azure Blob Storage is currently fully implemented in this branch and it will be merged to our master branch after being reviewed. Please feel free to test out. I'd be grateful to receive feedbacks!

rafafael03 commented 2 months ago

Oh, thank you so much @landscapepainter! I'll start some experiments here from my side 🚀

rafafael03 commented 2 months ago

Hi @landscapepainter!

I made some experiments and for me it worked flawlessly!

About the storage, I cannot tell about COPY mode because I just used the MOUNT mode. I got everything working fine, actually having batter training times (with a T4 16GB GPU) than the local GPU (RTX 2060 12GB).

I just had 2 doubts, not related to the Azure Blob Storage feature itself, then I don't know if it's the right place to ask. I'll add it here, but I can edit it to remove this part if needed.

The first doubt is if there is a way to pass parameters to the sky launch or sky exec commands. I would like to do it so that I could change things like the learning rate without editing the Task yaml file.

The second doubt is just to check if that autostop idle time start counting when the job is submitted or just when the job finishes (succeeded or failed). In other words, do I have to set the idle time to be minutes_after_training_finished or to be training_time + minutes_after_training_finished?

Thank you for this amazing feature of adding support to Azure Blob on SkyPilot! And thank you in advance for the answers!!

landscapepainter commented 2 months ago

Hi @rafafael03, thanks for trying it out! If you have general questions, the fastest way to get an answer is to ask in our slack http://slack.skypilot.co at #skypilot-users channel.

rafafael03 commented 2 months ago

Oh, I didn't know about the Slack community. Thank you for referencing it @landscapepainter!

In this meantime, I found a "bug". I'm not sure if I can tell it like so. When I try to launch 2 (or more) tasks that are mounting the same blob, one just after the other, the second one will fail because the blob is blocked because it's being mounted. It's solved simply by waiting for about 1 minute after launching the first task, then the second task is launched with no errors.

It's not literally an issue from SkyPilot directly, but probably a limitation from Azure Blob Storage. But Skypilot could try to avoid this error by (1) managing the all launched tasks, and if one or more tries to mount the same blob, it waits until the previous task finishes the mounting/launch process and/or (2) implementing a retry into the mounting process (that waits for about 1 minute).

With that someone could launch as many tasks as they want without any failures and no needing to wait 1 minute between launching parallel tasks 🙂

This is not a big pain, and for sure SkyPilot is already helping a lot!!! Thank you again for everything you've done, @landscapepainter!

landscapepainter commented 2 months ago

@rafafael03 Thanks for sharing the issue and solution! If you still happen to have the script that caused this problem, please help to share the script so I can reproduce and fix your scenario with Azure Blob Storage along with other storage mounts(s3, gcs). Feel free the leave a comment of the issue with the script at the PR: https://github.com/skypilot-org/skypilot/pull/3032

WilliamGazeley commented 2 weeks ago

Any chance this will be merged into the nightly build soon?

landscapepainter commented 3 days ago

Hey @WilliamGazeley, thanks for checking in. This is expected to be merged in the nightly build very soon.