googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.18k stars 713 forks source link

Sessions running in the background with Colab Pro+ cannot be continued, though they still run. #3764

Closed aegonwolf closed 1 year ago

aegonwolf commented 1 year ago

Hi,

I've run into the issue that I have a session that is executing in the background for about 10 hours and I've tried to open it again. It shows that it's still running, when I click on it in manage sessions I get a window as usual but it it's somewhere stuck in between. It asks me to connect, if I click that it starts a new instance. At the same time, before connecting I can click on disconnect and delete runtime (I haven't tried that).

This is very frustrating. For one, this is a A100 session, so it uses up significant resources. And I am paying for this.

  1. I am in a time crunch and I need this done, losing 10-12 hours is not acceptable. I have tried and am likely moving more work to lambdalabs.ai now. The lack of real support is one of the main reasons.
yosuaputra1 commented 1 year ago

I'm having this issue as well, it costs me 200 compute units and I can't reopen the active session

giantvision commented 1 year ago

Same issue. I cannot stop it. I only opened one session for training work, but there are 4 sessions displayed in the background, I have been trying to close the unused sessions, but all failed, and now the remaining three unused sessions are consuming my time computing units

aegonwolf commented 1 year ago

I tried again today, in total I've lost 500+ compute units. It's basically just wasted everything I paid for nothing. Support is only for billing and they say this isn't billing. But the only reason this is an issue is because it's "billing" compute units. Why would you pay for pro+ and pro if you can't do background execution? That's false advertisement, the continuing deduction of the compute units is borderline fraud.

EvanWiederspan commented 1 year ago

Tracking internally as b/287473576

aegonwolf commented 1 year ago

Tracking internally as b/287473576

Any update on this? This still keeps happening and I've also experienced the other peoples issue where you can't disconnect and delete a runtime anymore. This is quite an urgent issue I think as it wastes resources (colab's) and money (users).

yosuaputra1 commented 1 year ago

Tracking internally as b/287473576

Looking forward to the update. I believe this is a serious issue that needs to be handled quickly as the background execution is the main reason a lot of people subscribe to the Colab Pro+

hosungs commented 1 year ago

To diagnose this issue further, could you please leave an in-product feedback (Help > Send feedback from https://colab.research.google.com/) with a reference to this Github issue in your feedback?

hosungs commented 1 year ago

We investigated this issue further (after a user's in-product feedback of this same issue was received), and now we believe that this is highly likely due to the notebook being executed is publicly shared (meaning read-only for all users except the original author). This is such an example notebook. A workaround for this issue in such a case is to make a copy of such a publicly shared read-only notebook, BEFORE you change any notebook setting (e.g., GPU type, to A100), and run the code cells on your copy of the notebook, not on the shared read-only notebook. This should allow Pro+ customers to reconnect to the background-executing sessions without any problem. It would be great if OP can confirm whether it was the case and whether the workaround works. We'll discuss how to handle this situation better and follow up. Thanks.

aegonwolf commented 1 year ago

We investigated this issue further (after a user's in-product feedback of this same issue was received), and now we believe that this is highly likely due to the notebook being executed is publicly shared (meaning read-only for all users except the original author). This is such an example notebook. A workaround for this issue in such a case is to make a copy of such a publicly shared read-only notebook, BEFORE you change any notebook setting (e.g., GPU type, to A100), and run the code cells on your copy of the notebook, not on the shared read-only notebook. This should allow Pro+ customers to reconnect to the background-executing sessions without any problem. It would be great if OP can confirm whether it was the case and whether the workaround works. We'll discuss how to handle this situation better and follow up. Thanks.

None of my notebooks are publicly shared. It happens with multiple notebooks of unrelated projects. I've sent feedback before without getting any reply I've done so again. Curiously the GPU type is another issue I've been having in the past month, you can choose A100 it shows A100 but it's the smaller gpu that actually runs and gets connected. Also, this workaround is something I've tried myself before the first time this was happening because I was hoping to work on it and maybe in the meantime I'd be able to connect to again.

hosungs commented 1 year ago

Thanks @aegonwolf for your response here and also your in-product feedback, which I looked and confirmed that your notebooks are not public, and you are experiencing another, possibly very tricky issue. The issue here is that your updated GPU type setting (to A100) in the Notebook settings dialog (Runtime > Change runtime type) is somehow lost in the Google Drive notebook file, so that when you try to reconnect to the running session for the notebook on a new browser session (e.g., about 10 hours later as you mentioned originally), it doesn't see the running session (because Colab needs to match the running session's GPU type, with the saved notebook settings), and thus it can't be reconnected. A situation like this can happen if you have multiple tabs opened for the same notebook, run the notebook from one tab after updating the GPU type notebook setting, and somehow overwrite the notebook setting update from another tab, likely through some notebook autosave failure due to a merge conflict, then manually saving the original version. I understand this might be highly an unlikely corner case, but this is first that comes to my mind and I'm wondering if you can confirm this or not? Also, does this issue occur consistently on certain notebooks and/or for specific users, or if it happens sometime, but doesn't happen all the time? We'll keep looking if this issue might happen due to some other bug on our side.

In the meantime, to avoid this issue and also to help us debug this better, may I ask the following?

I also wanted to mention about A100 vs. V100. A100 demand in Colab is quite high these days, and when an A100 can't be assigned quickly enough, Colab falls back to V100, with a notification message saying Selected GPU type is not available. Trying another GPU type. Check the GPU type setting in Runtime > Change runtime type. The GPU type notebook setting should be updated to V100 in the notebook settings dialog if this happens. If you'd like to retry to get A100, you'll need to disconnect-and-delete the current runtime (V100), change the GPU type to A100 in the notebook settings dialog again, save, and click Connect again. Unfortunately we still can't guarantee A100 on a retry, so you may still get a V100. Deleting V100 runtimes immediately should not cost you any compute units if done quickly enough. Hope this helps, and thanks for using Colab and reporting this issue to us.

aegonwolf commented 1 year ago

@hosungs thank you for the reply. I wouldn't have more than one tab open intentionally. Autosave fails a lot these days it's possible that is an issue. I do often see that I have V100 instead of A100 though I never see the notification, I just run out of memory then I do the disconnect and delete (which btw often also doesn't work these days, I think I've commented on this issue by another user).

I will try to remember, though i am somewhat afraid of doing this, I've wasted 1000 or more units on this because the only time I would have background execution was overnight. I appreciate that you (hopefully) found the issue and I may try again, though my budgets aren't unlimited and these units cost money.

aegonwolf commented 1 year ago

@hosungs You probably aren't the right person for this. I apologize, this is just extremely frustrating. After this has happened a few times I've also contacted support and asked for a refund, I've got records of me purchasing extra 500 compute units twice. Billing cancelled my subscription and refunded the units that I have left (which wasn't much), after the rest had been wasted. I didn't expect a refund, but this logic is completely insane. The more units are wasted the less you refund and on top of it you cancel the subscription.

colaboratory-team commented 1 year ago

@aegonwolf Apologies for making you go through this again, but please file an in-product feedback again, referencing this GitHub thread with your Github ID, and we'll verify the info and try to refund you for the wasted compute units.