lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
317 stars 105 forks source link

FSTimeoutError #1130

Closed oanamoc closed 1 year ago

oanamoc commented 1 year ago

I'm trying to run a code that counts the words in a file through lithops. My issue is this error:

Exception has occurred: FSTimeoutError X Read timeout on endpoint URL: "https://training-with-a-bucket.s3.eu-west- 3.amazonaws.com/notsobigtextfile_small.txt" aiohttp.clientexceptions.ServerTimeoutError: Timeout on reading data from socket During handling of the above exception, another exception occurred: aiobotocore. response. AioReadTimeoutError: Read timeout on endpoint URL: "https: //training-with-a-buckets3. eu-west- 3.amazonaws. com/notsobigtextfile small.txt" The above exception was the direct cause of the following exception: File "/Users/oanamoc/Desktop/Work/lithops/task1 copy.py", line 35, in line = fhand. readline) fsspec.exceptions .FSTimeoutError:

I will add the code here but I don't think the code is the problem. It may be related to some settings in aws s3. In the end I tried making the bucket public and run the code anonymously to see if maybe the issue was that it didn't see my credentials file but the error stayed the same. I don't know how to proceed.

code.txt

JosepSampe commented 1 year ago

In this case it doesn't seem to be a lithops issue.

My recommendation, if you want to process files stored in the cloud (s3), is to use the Futures API instead of the Multiprocessing API.

The current approach you are using is not good because you are downloading a file that is in s3 to you computer, split in chunks locally, and then upload the data again to the cloud for processing. This is time consuming and expensive since you are moving all the data twice.

The futures API includes an automatic data processing feature that given a txt file in s3, it will automatically spawn the appropriate number of functions to process it, without the need to download the file to you local computer. For splitting the file you can use obj_chunk_number or obj_chunk_size parameters of the fexec.map() call. The file will be automatically splited (according to obj_chunk_number or obj_chunk_size) in realtime by the system, so that each function will automatically receive a portion of the file.

You can see a word count example here https://github.com/lithops-cloud/lithops/blob/master/examples/map_cos_prefix.py Check the docs about data processing here https://lithops-cloud.github.io/docs/source/data_processing.html

oanamoc commented 1 year ago

Thank you, now it works.