googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.19k stars 718 forks source link

OSError: [Errno 5] Input/output error #510

Closed salmannauman6 closed 5 years ago

salmannauman6 commented 5 years ago

Bug report for Colab: http://colab.research.google.com/.

colaboratory-team commented 5 years ago

Does https://research.google.com/colaboratory/faq.html#drive-timeout help?

salmannauman6 commented 5 years ago

No, it does not. I have just one folder in my root folder which contains this one CSV file I am reading.

colaboratory-team commented 5 years ago

Thanks for confirming. Can you share a minimal self-contained repro notebook, either publicly or just with colaboratory-team@google.com ? (it would be helpful to see precisely how you're reading the data)

Does the problem go away if you first !cp path/to/data.csv local.csv and then read from the local path?

ShHsLin commented 5 years ago

Similar issue here. Get gzip: stdin: Input/output error tar: Child returned status 1 tar: Error is not recoverable: exiting now

when doing, !tar -zxvf /content/gdrive/My\ Drive/data.tgz -C ./ > /dev/null with a large data.tgz file ~ 10GB.

Syzygy2048 commented 5 years ago

I've no issue accessing 20 GB files.

What causes this issue for me is when there are many files in the folder (or parent folders) I'm accessing. Instead of having path/to/data/data_x_of_1000files_in_folder.csv, I transformed the file structure to path/to/data/20folders/data_x_of_50files_in_folder.csv

Try making sure that there are no more than 50 files in the folder the file is in, or in any of the parent folders.

When I was only accessing a single file, or accessing files sequentially, I could also just try to load the file again, that worked because the context has been loaded already. This didn't work for random access.

Works for me, hope this helps you too.

sgabor1 commented 5 years ago

Similarly things were working without a problem until today, now the untar won't finish anymore with a large file: tar: /content/gdrive/My Drive/bigfile.tar: Cannot read: Operation not permitted tar: /content/gdrive/My Drive/bigfile.tar: Cannot read: Input/output error tar: Too many errors, quitting tar: Error is not recoverable: exiting now It could successfully untar all the files (31GB tar with 10000 files) even yesterday multiple times.. The command I'm using: !tar -C features -xf /content/gdrive/My Drive/bigfile.tar

Trying to copy the whole tar into the runtime first also timing out: cp: error reading '/content/gdrive/My Drive/bigfile.tar': Input/output error

furkanyildiz commented 5 years ago

I have same problem. I can not read my files on drive. It's sometimes working but mostly giving OSError

OSError: Can't read data (file read failed: time = Mon May 20 00:34:07 2019
, filename = '/content/drive/My Drive/train/trainX_file1', file descriptor = 83, errno = 5, error message = 'Input/output error', buf = 0xc71d3864, total read size = 42145, bytes this sub-read = 42145, bytes actually read = 18446744073709551615, offset = 119840768)

Also creating file giving the OSError.

OSError: Unable to create file (unable to open file: name = '/content/drive/My Drive/train/model.hdf5', errno = 5, error message = 'Input/output error', flags = 13, o_flags = 242)

"https://research.google.com/colaboratory/faq.html#drive-timeout" does not helped me.

kallianisawesome commented 5 years ago

I have same problem too. I can't load my data which is not very large, I can load it with num_workers = 1(use PyTorch Dataloader method), but I can't get my files. The number of my files is about 40000. I have tried io.imread or cv2.imread, they all work fine in my own computer, and I am sure that my files are in right place. I can't figure it out for days, I guess it' not my problem. I will try to get image matrix in my own computer and upload by csv format. If this work out, I will feedback.

The link below offers a method, but my files are already in subfolders, maybe it can help you. https://stackoverflow.com/questions/54973331/input-output-error-while-using-google-colab-with-google-drive

colaboratory-team commented 5 years ago

Duplicate of #559

yuuSiVo commented 5 years ago

I have same issue too. Today, I made voice conversion program in Google Colaboratory. Yesterday it was works. But, today not working since this morning in Japan

abiantorres commented 5 years ago

I have the same issue. I can't access to a hdf5 file of 42 GB. At some point of my processing pipe comes an OSError, as @furkanyildiz commented. I access each element sequentially and then stored it instantaneously in another .tfrecords file.

glenn-jocher commented 5 years ago

I have the same problem. This issue should not be closed. When copying a 20GB file from a mounted Google Drive folder:

!cp 'drive/My Drive/cloud/data/coco_colab2.zip' . && unzip -q coco_colab2.zip
cp: error reading 'drive/My Drive/cloud/data/coco_colab2.zip': Input/output error
gilgarad commented 5 years ago

Have same problem. I thought the file was corrupted first time, and I downloaded and opened on my local computer it was working fine. Then I uploaded to my brother's account and it was working as well. It is not the problem with the file. I can load other files except that csv file.

AlanCh3n commented 4 years ago

Same problem. Working perfectly and then suddenly stops with no changes implemented.

invincible-akshay commented 4 years ago

Thanks for confirming. Can you share a minimal self-contained repro notebook, either publicly or just with colaboratory-team@google.com ? (it would be helpful to see precisely how you're reading the data)

Does the problem go away if you first !cp path/to/data.csv local.csv and then read from the local path?

I tried this and getting cp: error reading '/content/drive/My Drive/DSF/file_name.csv': Input/output error

deqncho2 commented 4 years ago

Same problem here as well, reopen the issue.

invincible-akshay commented 4 years ago

I made an observation but not tested it. It seems that large files on Google Drive have some daily download limits. Could it be that trying to read from Colab is also counted as a download? If yes, then that explains why it suddenly stops working.

deqncho2 commented 4 years ago

I have no issue downloading to a local machine.

glenn-jocher commented 4 years ago

Same problem. I can download to a local machine fine. Downloading to Colab from Google Drive is a nightmare, it takes 5 or 6 tries before it completes successfully.

deqncho2 commented 4 years ago

I think it's a quota problem actually. I can't actually download to a local machine.

invincible-akshay commented 4 years ago

@deqncho2 you can test by creating a copy of the file in your Drive and trying to read the new file. That had worked for me hence I had not investigated further but came across this later - https://support.google.com/drive/thread/2035857?hl=en

MittalShruti commented 4 years ago

Same issue. Trying to read a folder with >40k files from gdrive to colab

invincible-akshay commented 4 years ago

@MittalShruti , maybe you could try this- https://github.com/googlecolab/colabtools/issues/510#issuecomment-552294940 ?

Or check this thread for details: https://support.google.com/drive/thread/2035857?hl=en

prashants975 commented 4 years ago

similar issue, reading the csv file through pandas was working fine and suddenly later that day I can't get it into the RAM. First I got this error , ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'. then after using engine='python' I got this:- OSError: [Errno 5] Input/output error

rajubjc commented 4 years ago

Thanks for confirming. Can you share a minimal self-contained repro notebook, either publicly or just with colaboratory-team@google.com ? (it would be helpful to see precisely how you're reading the data)

Does the problem go away if you first !cp path/to/data.csv local.csv and then read from the local path?

No, It didn't work.

talhaanwarch commented 4 years ago

any one got the solution? i moved the files to subfolder. and now each subfolder has one file. still i am getting this error

peterpalos commented 4 years ago

Same problem in colab reading from gdrive >100k file

talhaanwarch commented 4 years ago

i noticed that there is some sort of limitation by google. if we access data multiple time from drive, this issue occurs. take a break of 24 hours, and this issue has gone

Nandutu commented 4 years ago

I am facing the same issue here. Any solution?

prbsh9 commented 4 years ago

Same issue.. I have 4 npy file... 2 files around 10GB and 2 around 6GB.. . ss = np.load('train_abnormal.npy') Also,.. I cant open any of those 4 files. Files with lesser size can be opened though.

JonathanSum commented 4 years ago

I have the same problem too, so I am sure this will affect colab pro too.

cpietsch commented 4 years ago

I have the same issue when copying a folder of images (around 2000 jpgs)

cpietsch commented 4 years ago

Found a fix for the error when copying a lot of files. Use: %cp -av fromfolder tofolder Works for me

kechan commented 4 years ago

It just happened to me, apparently after doing various file operations (cp, tar, etc) leading to I/O between my gdrive and local colab VM. After this, i got random python OSError and Input/Output error, sometimes even at importing a python module. At another times, my colab notebook just entirely crashed (during a read feather of >1G) and log showed nothing meaningful.

I hope this is just the case of a gdrive daily quota issue as someone mentioned. Has anyone confirm this? I will wait for a day to pass and re-try.

Chin0p commented 4 years ago

Today it happened to me as well, when trying to unrar 30gb file in colab. I'm getting input/output error. Read error in the file

chyan0411 commented 4 years ago

I also got the same error: 'OSError: [Errno 5] Input/output error' when I was trying to import a 14G-large file from gdrive. This error occurred so suddenly, cuz like several minutes ago I did the same operation and everything was alright. When I tried importing a much smaller file from the same folder, it worked normally. It seems like colab has some limits for importing large files ??? Why this issue closed??? btw, I have subscribed colab pro!!!!

Duplicate of #559

sowmen commented 4 years ago

I am also getting the Input/Output error. I downloaded a dataset from kaggle into my drive. It has 50 zip files each having 2000 images. I successfully extracted 2 zip files but then the error started. Any solutions to this?

dipam7 commented 4 years ago

Same error, I am trying to load 194082 image files from the drive. It worked once, the first time that I extracted the data and tried to load it. It hasn't worked ever since. Even after I updated to colab pro it doesn't work. Frustrating.

dreamerriver commented 4 years ago

same as well,how can I deal with it?

shadabtughlaq commented 4 years ago

Got the same error

Could not read file
Errno 5] Input/output error: .........

Was working fine until some time back. Had loaded files from a mounted drive

kechan commented 4 years ago

For anyone having this problem with colab + gdrive, the most likely cause is excessive I/O due to large files, or merely "ls -l" on a folder with too many files. The latter case is more harmless (as long as you don't do that again, I find using glob much better behaved). The former case you most likely violate some google quota. In my experience, it is either large size (or extremely large # of small files such that the total is big) or, throughput (i.e. size/time).

The limit for me for a single file seemed to be around 10gb. Mileage seemed to vary. So don't copy file from gdrive <---> colab. Note, it counts as "upload" if you access gdrive file in colab via a Mount

Best solution is to use linux "split" to split your huge files into 500m-1g, and then upload it one by one to gdrive. When you need it in colab, then copy the fragment onto your colab's VM local disk, and then perform a "cat .....". This way, no giant file is ever moved from gdrive. The downside is you have to repeat for every new colab session.

It is a pain, but this whole thing isn't designed for huge scale dataset. Note if you violated and hit I/O Error, you have to wait for 1 day for this to go away. I would try not to do anything at all to your gdrive for at least 24 hrs for this to recover.

Hope this helps.

gaceladri commented 4 years ago

@kechan The answer would be perfect if you could provide an example of how to do this split with the command line. Thanks anyway for your answer.

homerdiaz commented 4 years ago

I have the same problem too. OSError: [Errno 5] Input/output error: '/content/drive/My Drive/COVID-Net/rsna-pneumonia-detection-challenge/stage_2_train_images/003d8fa0-6bf1-40ed-b54c-ac657f8495c5.dcm'

Will i get the same error with Colab Pro?

dipam7 commented 4 years ago

I tried using Colab pro and the error does not go. I then segregated my data from all the images in one folder to 50 images per folder. Now it runs however it does not return all the image files. Only 165034 out of 194082. Maybe I need to keep even lesser images per folder. However, this is really frustrating. I like paperspace better now.

salmanhiro commented 4 years ago

Same issue here, It works yesterday, but not now. With the same code. How could?

update: working by copying the files and put in a different directory. Then import the copied one

Zappytoes commented 4 years ago

This an unacceptable flaw in Colab and in my view, completely delegitimizes it as a platform for Machine learning. The ability of a computing platform to handle large amounts of data is absolutely essential in this field, and I think it's just plain crooked of Google to tell people they have a product that is designed for deep learning and ML. I paid for a Pro account and have tried every work around and "OSError: [Errno 5] Input/output error" will always show up again eventually and stop you dead in your tracks. This is not just a "bug", this is the reason you should not use Colab if you have other options.

dipam7 commented 4 years ago

I feel the same. Even after Colab pro, I had to split my data into various folders and it would partially work. I was so frustrated because I couldn't focus on the project. All my time went in trying to make Colab work.

gaceladri commented 4 years ago

It's free mate. Take a breath.

kechan commented 4 years ago

@Zappytoes That error is highly likely to do with google drive quota limit than Colab. I have used for Colab for almost 2 years and I found it an excellent platform to experiment with DL on smaller dataset (by modern standard). You are right you shouldn't use Colab if you have other options (i.e. lot of $$). If you work on >10g or more routinely, you should use GCP or AWS and pay the fair price. Pro is only $10? It is the best deal around for the sort of GPU and TPU you get.

kechan commented 4 years ago

@kechan The answer would be perfect if you could provide an example of how to do this split with the command line. Thanks anyway for your answer.

Using Linux "split" to shard a huge file is an old trick you can google around and read far better than i can explain it. Shipping around big file has been an issue since the internet is here. It is only what you mean by "big" that has changed.