googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.17k stars 705 forks source link

OSError: [Errno 107] Transport endpoint is not connected #3441

Open ioritree opened 1 year ago

ioritree commented 1 year ago

Hello, I am a user of colab pro+.(Chrome) I have been experiencing the problem of "OSError: [Errno 107] Transport endpoint is not connected" in the process of training the model every day for a while, and I encounter it almost once a day, how to solve it?

And during this period, the training is still going on but no new model progress is generated, and the calculation units are continuously deducted, and if you don't pay attention, the whole day's calculation units are deducted.

Watch the video upload video(click pic to watch video)

Start every day after the reset, training will appear after ten hours of this problem I'm going crazy, epochs need at least take 3~4 hours !!

This has been happening every day since 2023/2/17 when colab updated something The picture shows 8x% of the training progress, which took 3 hours, (at least 3 hours of training time per day was evaporated)

mount method from google.colab import drive drive.mount('/content/drive')

CathySunshine commented 1 year ago

I had the same issue. I tried several ways to report this issue but have not received a reply.

ioritree commented 1 year ago

I had the same issue. I tried several ways to report this issue but have not received a reply.

Start every day after the reset, training will appear after ten hours of this problem

anthonyromyn commented 1 year ago

looks like I'm getting the same problem

yanavdsande commented 1 year ago

I have the same problem - my units are deducted but the training was too short to be useful. I have saved my checkpoints and started training from there but my units are still gone.. @CathySunshine have you had a reply yet?

ioritree commented 1 year ago

I have the same problem - my units are deducted but the training was too short to be useful. I have saved my checkpoints and started training from there but my units are still gone.. @CathySunshine have you had a reply yet?

no reply any :(

anthonyromyn commented 1 year ago

Small update on my end: looks to happen to me after only 4.5 hours. Looking into it more today

CathySunshine commented 1 year ago

I have the same problem - my units are deducted but the training was too short to be useful. I have saved my checkpoints and started training from there but my units are still gone.. @CathySunshine have you had a reply yet?

Yes. seems to be the consequence of the mounted drive getting disconnected #3451 However, they are still investigating the disconnecting mounted drive issue

ioritree commented 1 year ago

I have the same problem - my units are deducted but the training was too short to be useful. I have saved my checkpoints and started training from there but my units are still gone.. @CathySunshine have you had a reply yet?

Yes. seems to be the consequence of the mounted drive getting disconnected #3451 However, they are still investigating the disconnecting mounted drive issue

At least it sounds like good news, I hope they can fix it as soon as possible

LeapGamer commented 1 year ago

I'm getting the same error in training, but colab runtime is still connected. My epochs take ~4hours, and I get this error when trying to save the data for the 3rd epoch during the run at the 12th hour:

FailedPreconditionError: ...drive_path... Transport endpoint is not connected

ranjitkathiriya commented 1 year ago

looks like I'm getting the same problem

panyan7 commented 1 year ago

I'm getting the same problem. Waiting for a solution.

Great-Bucket commented 1 year ago

To test this problem more thoroughly, I wrote the a very simple script that makes a calculation with the GPU and then posts a 'dummy_file' to my Google Drive every 10 minutes. Though I expected it to run for 24 hours, the dummy files were only written for 9.7hrs. The final error message was 'Transport endpoint is not connected'.

More details:

3562

dcloveUIUC commented 1 year ago

Getting this same issue - any updates?

nathanmoura commented 1 year ago

Same problem. Waiting for a solution

noirmist commented 1 year ago

Same Problem. Any updates?

Amosharid commented 1 year ago

Same problem. Any updates?

huangeddie commented 1 year ago

Same problem. Any updates?

TeoCavi commented 1 year ago

Same problem, Any updates?

MateuszPuto commented 1 year ago

For me it was 4 hours. Pretty annoying bug for a deep learning environment.

olaviinha commented 1 year ago

Facing the same problem recently.

So far with only one particular, oldish notebook. So not sure if related to notebook or just coincidentally occurred now with that notebook. Was suspecting from the error messages that after a couple of hours Colab is for some reason unable to continue reading my files from Drive.

GabrielOlem commented 1 year ago

Same problem, Any updates?

mmarr96 commented 1 year ago

same issue, this is absurd

Warajet commented 1 year ago

I have been facing this problem for a few months now and it cause me unable to finish my thesis on time! Please fix it asap! This problem is really annoying! What a waste of paying for Google Colab Pro+

Great-Bucket commented 1 year ago

Update: starting 30 June my colab connection is dropping after <4hours, prior to that it was <10hours. I am experiencing this with 2 different google accounts which exhibit the same behaviour.

mmarr96 commented 1 year ago

still facing the cut off at 4 hours, might be time to give up with colab

filipemesquita commented 1 year ago

This issue has been happening in less than 4 hrs now.

ackrds commented 1 year ago

Same issue has happened with me. Though the running time was less than 10 hours in my case.

WALLERR commented 1 year ago

The same issue, the session can never run over 4 hours.

ackrds commented 1 year ago

I have figured it out. You can import Google's API to download whatever data you want into your session. Drive can be mounted for only so long, but downloaded data remains as long as the session runs.

Archicyr commented 1 year ago

Same issue here. Cut off around 3-4 hrs running. Colab Pro subscription.

asal9122 commented 1 year ago

Same issue here after 4-5 hours I get disconnected in middle of epochs

Colab pro plus!!!

olaviinha commented 1 year ago

Still facing this issue after a month. Just adding that:

  1. it happens also with freshly created notebooks (not just "oldish" ones, as was my experience a month ago).

  2. Since devs are referencing https://research.google.com/colaboratory/faq.html#drive-timeout in other similar reported issues, I want to state that this same old Drive disconnection happened just now after 4 hours while I was accessing 2 folders in Drive during the session, both of which had 0-5 files.

fcx245317735 commented 1 year ago

I see! If I train my model for more than about 4 to 5 hours, Colab keeps giving the same error like yours.

luccahuguet commented 1 year ago

same problem, please solve this!

Ruoqi277 commented 1 year ago

Same problem, any solution??

gallen881 commented 1 year ago

Same issue, waiting for the solution.

aaronlifenghan commented 1 year ago

i have the same issue error. I have Pro version of colab, after 6h ish, it reported this error, when the fine-tuning iterations finished as designed, 30 iterations. but then it report this error when it is supposed to carry out a prediction task. Yesterday it happened like this. today I remounted drive and rerun this cell, and it ended with the same error. so the remount did not solve my problem. Screenshot 2023-08-17 at 18 49 44

aaronlifenghan commented 1 year ago

i have the same issue error. I have Pro version of colab, after 6h ish, it reported this error, when the fine-tuning iterations finished as designed, 30 iterations. but then it report this error when it is supposed to carry out a prediction task. Yesterday it happened like this. today I remounted drive and rerun this cell, and it ended with the same error. so the remount did not solve my problem. Screenshot 2023-08-17 at 18 49 44

Update in the second day, i.e. today: this error disappeared when I changed 'iteration = 30 to 2' for the fine-tunign step. The run out documentary is available on our project page, uploaded if anyone is interested to check https://github.com/HECTA-UoM/M3/tree/main still do not know why that error happened yesterday when iteration=30 since I have the Pro version and it only cost 5h.

Nayrouzzz commented 1 year ago

it happened after running it for only 4 hours !

olaviinha commented 1 year ago

I don't think this problem is related to which plan you are subscribed to or what exactly you are doing in Colab, as long as you are accessing/writing files on Google Drive. I haven't had free plan for years and am experiencing this problem with various notebooks all summer...

LiuBodan commented 1 year ago

I'm having similar issue, not sure how to solve it

mainguyenanhvu commented 1 year ago

I'm having the same issue. I just reset my runtime and lost all variables.

mmarr96 commented 1 year ago

seeing this error every single day

brunotakara commented 1 year ago

I got the same issue after something between 3h30 and 4h, pretty frustating on deep learning applications

Waynekagawa commented 1 year ago

I have the same problem on running a PyMC script and a separate PyTorch script. Both of them take >4 hrs to finish. Please fix this :( I have to finish my dissertation.

simonep1052 commented 1 year ago

I face the same issue for weeks. My notebooks encounter this error after approximately 4 hours of usage, and I'm also a Colab+ user.

While I haven't yet discovered a definitive solution to this problem, I did find a workaround that allows me to complete all my training epochs in the end. The method is rather straightforward: I refrain from mounting Google Drive when running my notebooks. To be more specific, I copied all my scripts and files directly to the notebook's local drive, placing them in a directory like /myscripts. This is just a new directory located at the same level as /content. You could have other directory name. Just make sure not to mount Google Drive while I run training. This approach has proven effective for me.

Here I provide the statistics of my training info:

Sara-hub18 commented 1 year ago

I face the same issue for weeks. My notebooks encounter this error after approximately 4 hours of usage, and I'm also a Colab+ user.

While I haven't yet discovered a definitive solution to this problem, I did find a workaround that allows me to complete all my training epochs in the end. The method is rather straightforward: I refrain from mounting Google Drive when running my notebooks. To be more specific, I copied all my scripts and files directly to the notebook's local drive, placing them in a directory like /myscripts. This is just a new directory located at the same level as /content. You could have other directory name. Just make sure not to mount Google Drive while I run training. This approach has proven effective for me.

Here I provide the statistics of my training info:

  • GPU used: V100 High-RAM
  • Training time (including setting up the notebook environment): 14 hours.

Can you post an image for more explain please? @simonep1052

olaviinha commented 1 year ago

Can you post an image for more explain please?

As a workaround

You can manually upload files to Colab runtime instead of using Drive mount, by right-click in Files tab or upload button on top:

image

If you upload a bunch of files as ZIP, you can run unzip /content/<filename>.zip in terminal or !unzip /content/<filename>.zip in a cell after upload.

Downloading works in similar fashion, by right-click on file:

image

Or if you want to zip a bunch of files first for easier download: !zip /content/<filename>.zip /content/path/to/files/* then just download the zip.

Uploading/downloading manually is very slow in my experience though. Painful if working with large files.

perceptionisfun commented 1 year ago

I found a way to solve this issue. Apparently this is caused by a google drive bug. I copied all my files from google drive to google cloud storage and then loaded my data from google cloud storage to colab.

here is how you can copy your files from google drive to google cloud storage (you can also just directly upload your files to google cloud storage):

https://stackoverflow.com/questions/48122091/copy-file-from-google-drive-to-google-cloud-storage-within-google

here is how you can load your data from google cloud storage to colab:

https://stackoverflow.com/questions/66938971/is-there-a-way-to-use-the-data-from-google-cloud-storage-directly-in-colab

hope this helps!!

simonep1052 commented 1 year ago

I face the same issue for weeks. My notebooks encounter this error after approximately 4 hours of usage, and I'm also a Colab+ user. While I haven't yet discovered a definitive solution to this problem, I did find a workaround that allows me to complete all my training epochs in the end. The method is rather straightforward: I refrain from mounting Google Drive when running my notebooks. To be more specific, I copied all my scripts and files directly to the notebook's local drive, placing them in a directory like /myscripts. This is just a new directory located at the same level as /content. You could have other directory name. Just make sure not to mount Google Drive while I run training. This approach has proven effective for me. Here I provide the statistics of my training info:

  • GPU used: V100 High-RAM
  • Training time (including setting up the notebook environment): 14 hours.

Can you post an image for more explain please? @simonep1052

Just like @olaviinha and @perceptionisfun mentioned. Here I summarize the steps for completing my training, which should achieve the same result as their solutions.

  1. Open my .ipynb notebook.

  2. Mount Google Drive.

  3. Utilize the !cp sourceFileName -R targetFolderPath command to copy the necessary files from my Google Drive to Google cloud storage. In my specific case, I employ !cp '/content/drive/MyDrive/Colab Notebooks/project1/' -R '/project1/'. You can observe the copied files within the directory in the image below:

image At some points, maybe it is better to copy files to the content folder for avoiding further problems.

  1. Unmount Google Drive by:
    from google.colab import drive
    drive.flush_and_unmount()
  2. Start running my codes for training on my notebook.
  3. Once the training is completed and confirmed, if you wish to download the newly generated files (such as trained weights file) from Google cloud storage, simply remount Google Drive and utilize the !cp sourceFileName -R targetFolderPath command again. (For this step, other people's methods might be better but I have not tried.)

*Please note that this approach is simply a means to complete my training without encountering the issue and does not resolve the OSError: [Errno 107] problem.