Open ioritree opened 1 year ago
I had the same issue. I tried several ways to report this issue but have not received a reply.
I had the same issue. I tried several ways to report this issue but have not received a reply.
Start every day after the reset, training will appear after ten hours of this problem
looks like I'm getting the same problem
I have the same problem - my units are deducted but the training was too short to be useful. I have saved my checkpoints and started training from there but my units are still gone.. @CathySunshine have you had a reply yet?
I have the same problem - my units are deducted but the training was too short to be useful. I have saved my checkpoints and started training from there but my units are still gone.. @CathySunshine have you had a reply yet?
no reply any :(
Small update on my end: looks to happen to me after only 4.5 hours. Looking into it more today
I have the same problem - my units are deducted but the training was too short to be useful. I have saved my checkpoints and started training from there but my units are still gone.. @CathySunshine have you had a reply yet?
Yes. seems to be the consequence of the mounted drive getting disconnected #3451 However, they are still investigating the disconnecting mounted drive issue
I have the same problem - my units are deducted but the training was too short to be useful. I have saved my checkpoints and started training from there but my units are still gone.. @CathySunshine have you had a reply yet?
Yes. seems to be the consequence of the mounted drive getting disconnected #3451 However, they are still investigating the disconnecting mounted drive issue
At least it sounds like good news, I hope they can fix it as soon as possible
I'm getting the same error in training, but colab runtime is still connected. My epochs take ~4hours, and I get this error when trying to save the data for the 3rd epoch during the run at the 12th hour:
FailedPreconditionError: ...drive_path... Transport endpoint is not connected
looks like I'm getting the same problem
I'm getting the same problem. Waiting for a solution.
To test this problem more thoroughly, I wrote the a very simple script that makes a calculation with the GPU and then posts a 'dummy_file' to my Google Drive every 10 minutes. Though I expected it to run for 24 hours, the dummy files were only written for 9.7hrs. The final error message was 'Transport endpoint is not connected'.
More details:
Getting this same issue - any updates?
Same problem. Waiting for a solution
Same Problem. Any updates?
Same problem. Any updates?
Same problem. Any updates?
Same problem, Any updates?
For me it was 4 hours. Pretty annoying bug for a deep learning environment.
Facing the same problem recently.
So far with only one particular, oldish notebook. So not sure if related to notebook or just coincidentally occurred now with that notebook. Was suspecting from the error messages that after a couple of hours Colab is for some reason unable to continue reading my files from Drive.
Same problem, Any updates?
same issue, this is absurd
I have been facing this problem for a few months now and it cause me unable to finish my thesis on time! Please fix it asap! This problem is really annoying! What a waste of paying for Google Colab Pro+
Update: starting 30 June my colab connection is dropping after <4hours, prior to that it was <10hours. I am experiencing this with 2 different google accounts which exhibit the same behaviour.
still facing the cut off at 4 hours, might be time to give up with colab
This issue has been happening in less than 4 hrs now.
Same issue has happened with me. Though the running time was less than 10 hours in my case.
The same issue, the session can never run over 4 hours.
I have figured it out. You can import Google's API to download whatever data you want into your session. Drive can be mounted for only so long, but downloaded data remains as long as the session runs.
Same issue here. Cut off around 3-4 hrs running. Colab Pro subscription.
Same issue here after 4-5 hours I get disconnected in middle of epochs
Colab pro plus!!!
Still facing this issue after a month. Just adding that:
it happens also with freshly created notebooks (not just "oldish" ones, as was my experience a month ago).
Since devs are referencing https://research.google.com/colaboratory/faq.html#drive-timeout in other similar reported issues, I want to state that this same old Drive disconnection happened just now after 4 hours while I was accessing 2 folders in Drive during the session, both of which had 0-5 files.
I see! If I train my model for more than about 4 to 5 hours, Colab keeps giving the same error like yours.
same problem, please solve this!
Same problem, any solution??
Same issue, waiting for the solution.
i have the same issue error. I have Pro version of colab, after 6h ish, it reported this error, when the fine-tuning iterations finished as designed, 30 iterations. but then it report this error when it is supposed to carry out a prediction task. Yesterday it happened like this. today I remounted drive and rerun this cell, and it ended with the same error. so the remount did not solve my problem.
i have the same issue error. I have Pro version of colab, after 6h ish, it reported this error, when the fine-tuning iterations finished as designed, 30 iterations. but then it report this error when it is supposed to carry out a prediction task. Yesterday it happened like this. today I remounted drive and rerun this cell, and it ended with the same error. so the remount did not solve my problem.
Update in the second day, i.e. today: this error disappeared when I changed 'iteration = 30 to 2' for the fine-tunign step. The run out documentary is available on our project page, uploaded if anyone is interested to check https://github.com/HECTA-UoM/M3/tree/main still do not know why that error happened yesterday when iteration=30 since I have the Pro version and it only cost 5h.
it happened after running it for only 4 hours !
I don't think this problem is related to which plan you are subscribed to or what exactly you are doing in Colab, as long as you are accessing/writing files on Google Drive. I haven't had free plan for years and am experiencing this problem with various notebooks all summer...
I'm having similar issue, not sure how to solve it
I'm having the same issue. I just reset my runtime and lost all variables.
seeing this error every single day
I got the same issue after something between 3h30 and 4h, pretty frustating on deep learning applications
I have the same problem on running a PyMC script and a separate PyTorch script. Both of them take >4 hrs to finish. Please fix this :( I have to finish my dissertation.
I face the same issue for weeks. My notebooks encounter this error after approximately 4 hours of usage, and I'm also a Colab+ user.
While I haven't yet discovered a definitive solution to this problem, I did find a workaround that allows me to complete all my training epochs in the end. The method is rather straightforward: I refrain from mounting Google Drive when running my notebooks. To be more specific, I copied all my scripts and files directly to the notebook's local drive, placing them in a directory like /myscripts
. This is just a new directory located at the same level as /content
. You could have other directory name. Just make sure not to mount Google Drive while I run training. This approach has proven effective for me.
Here I provide the statistics of my training info:
I face the same issue for weeks. My notebooks encounter this error after approximately 4 hours of usage, and I'm also a Colab+ user.
While I haven't yet discovered a definitive solution to this problem, I did find a workaround that allows me to complete all my training epochs in the end. The method is rather straightforward: I refrain from mounting Google Drive when running my notebooks. To be more specific, I copied all my scripts and files directly to the notebook's local drive, placing them in a directory like
/myscripts
. This is just a new directory located at the same level as/content
. You could have other directory name. Just make sure not to mount Google Drive while I run training. This approach has proven effective for me.Here I provide the statistics of my training info:
- GPU used: V100 High-RAM
- Training time (including setting up the notebook environment): 14 hours.
Can you post an image for more explain please? @simonep1052
Can you post an image for more explain please?
You can manually upload files to Colab runtime instead of using Drive mount, by right-click in Files tab or upload button on top:
If you upload a bunch of files as ZIP, you can run unzip /content/<filename>.zip
in terminal or !unzip /content/<filename>.zip
in a cell after upload.
Downloading works in similar fashion, by right-click on file:
Or if you want to zip a bunch of files first for easier download: !zip /content/<filename>.zip /content/path/to/files/*
then just download the zip.
Uploading/downloading manually is very slow in my experience though. Painful if working with large files.
I found a way to solve this issue. Apparently this is caused by a google drive bug. I copied all my files from google drive to google cloud storage and then loaded my data from google cloud storage to colab.
here is how you can copy your files from google drive to google cloud storage (you can also just directly upload your files to google cloud storage):
here is how you can load your data from google cloud storage to colab:
hope this helps!!
I face the same issue for weeks. My notebooks encounter this error after approximately 4 hours of usage, and I'm also a Colab+ user. While I haven't yet discovered a definitive solution to this problem, I did find a workaround that allows me to complete all my training epochs in the end. The method is rather straightforward: I refrain from mounting Google Drive when running my notebooks. To be more specific, I copied all my scripts and files directly to the notebook's local drive, placing them in a directory like
/myscripts
. This is just a new directory located at the same level as/content
. You could have other directory name. Just make sure not to mount Google Drive while I run training. This approach has proven effective for me. Here I provide the statistics of my training info:
- GPU used: V100 High-RAM
- Training time (including setting up the notebook environment): 14 hours.
Can you post an image for more explain please? @simonep1052
Just like @olaviinha and @perceptionisfun mentioned. Here I summarize the steps for completing my training, which should achieve the same result as their solutions.
Open my .ipynb
notebook.
Mount Google Drive.
Utilize the !cp sourceFileName -R targetFolderPath
command to copy the necessary files from my Google Drive to Google cloud storage. In my specific case, I employ !cp '/content/drive/MyDrive/Colab Notebooks/project1/' -R '/project1/'
. You can observe the copied files within the directory in the image below:
At some points, maybe it is better to copy files to the content
folder for avoiding further problems.
from google.colab import drive
drive.flush_and_unmount()
!cp sourceFileName -R targetFolderPath
command again. (For this step, other people's methods might be better but I have not tried.)*Please note that this approach is simply a means to complete my training without encountering the issue and does not resolve the OSError: [Errno 107] problem.
Hello, I am a user of colab pro+.(Chrome) I have been experiencing the problem of "OSError: [Errno 107] Transport endpoint is not connected" in the process of training the model every day for a while, and I encounter it almost once a day, how to solve it?
And during this period, the training is still going on but no new model progress is generated, and the calculation units are continuously deducted, and if you don't pay attention, the whole day's calculation units are deducted.
upload video(click pic to watch video)
Start every day after the reset, training will appear after ten hours of this problem I'm going crazy, epochs need at least take 3~4 hours !!
This has been happening every day since 2023/2/17 when colab updated something The picture shows 8x% of the training progress, which took 3 hours, (at least 3 hours of training time per day was evaporated)
mount method from google.colab import drive drive.mount('/content/drive')