minio / minio-py

MinIO Client SDK for Python
https://docs.min.io/docs/python-client-quickstart-guide.html
Apache License 2.0
851 stars 325 forks source link

Broken pipe #598

Closed maccradar closed 6 years ago

maccradar commented 6 years ago

Maybe related to https://github.com/requests/requests/issues/2422, Sometimes, but quite frequently, we receive following response after a Minio.fput_object call:

WARNING Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BrokenPipeError(32, 'Broken pipe'))': /<bucket_id>/<file_id>?partNumber=1&uploadId=0fa52676-2bc7-45a1-a9fc-d878132b53e4

It then attempts to open a new HTTP connection but hangs and the only way to proceed is to kill the python process and restart it, after which it usually uploads fine.

Have any of you experienced this and is there a way to capture this error better so we can retry explicitly?

Here is the code snippet which causes this issue:

def upload_to_bucket(self, bucket, file):
        logging.info("Sending file %s.", file)
        path = self.path_constructor.to_file_in_dataset(bucket, file)
        try:
            etag = self.minio_client.fput_object(bucket, file, path)
            logging.info("File %s uploader to MinIO. Received etag: %s", file, etag)
            return True
        except ResponseError as err:
            logging.error("Response error when uploading file (skipping): %s", err)
            return False

The warning message I get has to come from within Minio or the underlying urllib3, because our logger does not register successful or failed upload, the process hangs somewhere in the fput_object call.

maccradar commented 6 years ago

From minio/api.py line 163

            self._http = urllib3.PoolManager(
                timeout=urllib3.Timeout.DEFAULT_TIMEOUT,
                        cert_reqs='CERT_REQUIRED',
                        ca_certs=certifi.where(),
                        retries=urllib3.Retry(
                            total=5,
                            backoff_factor=0.2,
                            status_forcelist=[500, 502, 503, 504]
                        )
            )

And from urllib3's ConnectionPool line 725

if not conn:
            # Try again
            log.warning("Retrying (%r) after connection "
                        "broken by '%r': %s", retries, err, url)
            return self.urlopen(method, url, body, headers, retries,
                                redirect, assert_same_host,
                                timeout=timeout, pool_timeout=pool_timeout,
                                release_conn=release_conn, body_pos=body_pos,
                                **response_kw)

It seems the timeout causes the error, but the retry fails (hangs).

kannappanr commented 6 years ago

It could be caused by urllib3. Can you post more information on the process that is causing this issue? Is there a pattern to this? Is it happening only with big files? What version of Python are you using? Can you post more information about the s3 server you are connecting to? We will try to reproduce the issue locally.

maccradar commented 6 years ago

Sorry for the late reply. We are running an upload service for our lab that checks for new fairly large datasets (up to 20GBs) and uploads them to a minio server running in a Docker Swarm at our data center. It seems to happen with smaller datasets (a few 100MBs) as well. The uploader uses Python 3.5 and it seems the Minio Python driver we installed with pip3 is not the latest version I find here as arguments for the MinioClient include only a timeout and not the full HttpClient object. Maybe upgrading the driver already would help?

harshavardhana commented 6 years ago

You can let us know how it behaves with the latest version ?

maccradar commented 6 years ago

Upgraded with pip3 install --upgrade minio and still running python 3.5. Now the python process silently dies after trying to send a 29GB file. First it almost blocks the server by consuming 90+ % of resources and after a while the python process, which was started via crontab dies without any notification...

I assume there is no 4GB file size limit on Minio and its Docker image I am not aware of?

maccradar commented 6 years ago

Server side log from the time of the Broken Pipe:

bme-fiber-minio.2.9fk65oiglp6q@node4    | time="2017-12-06T11:30:02Z" level=error msg="Unable to create object part." cause="read
 tcp 176.20.0.22:9000->176.20.0.31:48786: i/o timeout" source="[object-handlers.go:883:objectAPIHandlers.PutObjectPartHandler()]"
 stack="fs-v1-helpers.go:303:fsCreateFile fs-v1-multipart.go:543:fsObjects.PutObjectPart <autogenerated>:1:(*fsObjects).PutObject
Part object-handlers.go:881:objectAPIHandlers.PutObjectPartHandler api-router.go:48:(objectAPIHandlers).PutObjectPartHandler-fm"
maccradar commented 6 years ago

My bad, the 29GB transfer (attempt) causes our client's swap to be exhausted...:

cat /var/log/kern.log | grep python3
Dec 15 17:01:41 lbtf-mediator kernel: [4397030.116840] [ 1232]  1000  1232  2824531  1832646    7537      21   919631             0 python3
Dec 15 17:01:41 lbtf-mediator kernel: [4397030.116863] Out of memory: Kill process 1232 (python3) score 678 or sacrifice child
Dec 15 17:01:41 lbtf-mediator kernel: [4397030.116874] Killed process 1232 (python3) total-vm:11298124kB, anon-rss:7330584kB, file-rss:0kB, shmem-rss:0kB
Dec 15 17:01:41 lbtf-mediator kernel: [4397030.649052] oom_reaper: reaped process 1232 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

After increasing the swap, the process doesn't die but can't finish the 29GB upload. In my log files I see thousands of WARNING Connection pool is full, discarding connection: <minio hostname>

Could this be due to too many worker thread trying to upload the parts?

maccradar commented 6 years ago

After almost 7 hours the upload completed, leaving around 60GB of orphaned data in /.minio.sys/tmp. The next upload in line caused quite a few BrokenPipeErrors but succeeded in 8 mins for 4GB. There is quite a discrepancy between the 29GB transfer taking 7 hours and the 4GB transfer taking only 8 mins...

deekoder commented 6 years ago

This is not reproducing. So closing this as working as intended. Please reactivate if you think this still occuring.