singularityhub / sregistry

server for storage and management of singularity images
https://singularityhub.github.io/sregistry
Mozilla Public License 2.0
103 stars 42 forks source link

Unable to push images > 545MB with https #287

Closed n8wood-brown closed 4 years ago

n8wood-brown commented 4 years ago

Hi, I'm having trouble pushing larger images through the GUI, sregistry-cli, and singularity when https is enabled. I see there have been similar issues resolved in the past, I've gone through those threads multiple times to try out different suggestions but haven't had any success. I'm not sure if this is a bug or a misconfiguration on my part. When I change SREGISTRY_REGISTRY_BASE to use http everything works.

singularity version: 3.5.3 sregistry version: 1.1.22 sregistry-cli version:0.2.35

Success with an image < 545MB

$ sregistry pull n8/hello:latest
[client|registry] [database|sqlite:////home/n/.singularity/sregistry.db]
Progress |===================================| 100.0% 
[container][new] n8/hello:latest@36e003395e3cef288da37ee1826ec2b89e15b3ac7fa9b70be8a807f69562881b
Success! /home/n/.singularity/shub/n8/hello/latest@36e003395e3cef288da37ee1826ec2b89e15b3ac7fa9b70be8a807f69562881b.sif

$ sregistry push --name n8/hello:200319 .singularity/shub/n8/hello/latest@36e003395e3cef288da37ee1826ec2b89e15b3ac7fa9b70be8a807f69562881b.sif 
[client|registry] [database|sqlite:////home/n/.singularity/sregistry.db]
[1. Collection return status 200 OK]
[================================] 59/59 MB - 00:00:00
[Return status 200 Upload Complete]

Failure with a large image

The upload dies when it reaches around 554MB

@ganymede:~$ sregistry pull n8/fmri4:latest
[client|registry] [database|sqlite:////home/n/.singularity/sregistry.db]
Progress |===================================| 100.0% 
[container][new] n8/fmri4:latest@92278b7c046c0acf0952b3e1663b8abb819c260e8a96705bad90833d87ca0874
Success! /home/n/.singularity/shub/n8/fmri4/latest@92278b7c046c0acf0952b3e1663b8abb819c260e8a96705bad90833d87ca0874.sif

$ sregistry push --name n8/fmri:200319 /home/n/.singularity/shub/n8/fmri4/latest@92278b7c046c0acf0952b3e1663b8abb819c260e8a96705bad90833d87ca0874.sif
[client|registry] [database|sqlite:////home/n/.singularity/sregistry.db]
[1. Collection return status 200 OK]
Traceback (most recent call last): 545/4639 MB - 00:21:52
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.8/http/client.py", line 1230, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1276, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1225, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1043, in _send_output
    self.send(chunk)
  File "/usr/lib/python3.8/http/client.py", line 965, in send
    self.sock.sendall(data)
  File "/usr/lib/python3.8/ssl.py", line 1204, in sendall
    v = self.send(byte_view[count:])
  File "/usr/lib/python3.8/ssl.py", line 1173, in send
    return self._sslobj.write(data)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
    retries = retries.increment(
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.8/http/client.py", line 1230, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1276, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1225, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1043, in _send_output
    self.send(chunk)
  File "/usr/lib/python3.8/http/client.py", line 965, in send
    self.sock.sendall(data)
  File "/usr/lib/python3.8/ssl.py", line 1204, in sendall
    v = self.send(byte_view[count:])
  File "/usr/lib/python3.8/ssl.py", line 1173, in send
    return self._sslobj.write(data)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/n/.local/bin/sregistry", line 10, in <module>
    sys.exit(main())
  File "/home/n/.local/lib/python3.8/site-packages/sregistry/client/__init__.py", line 391, in main
    main(args=args, parser=parser, extra=extra)
  File "/home/n/.local/lib/python3.8/site-packages/sregistry/client/push.py", line 33, in main
    cli.push(path=image, name=args.name, tag=args.tag)
  File "/home/n/.local/lib/python3.8/site-packages/sregistry/main/registry/push.py", line 96, in push
    r = requests.post(url, data=monitor, headers=headers)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

nginx logs

- - [19/Mar/2020:20:17:27 +0000] "POST /api/upload/chunked_upload HTTP/1.1" 200 10 "-" "python-requests/2.22.0" "-"
- - [19/Mar/2020:20:20:22 +0000] "POST /upload HTTP/1.1" 400 0 "-" "python-requests/2.22.0" "-"

uwsgi logs

200319 fmri n8 SREGISTRY-HMAC-SHA256 Credential=push/<removed>,Signature=7346e058a5e48b271d2ff33996c244c0b090cdaa225a57513dc01f5ea2205932 {'collection': 'n8', 'tag': '200319', 'name': 'fmri'}
push|n8|20200319T20Z|fmri|200319|
[pid: 40|app: 0|req: 13/35] () {42 vars in 720 bytes} [Thu Mar 19 15:17:26 2020] POST /api/upload/chunked_upload => generated 10 bytes in 45 msecs (HTTP/1.1 200) 3 headers in 100 bytes (1 switches on core 3)

config.py

DOMAIN_NAME = 'https://mydomain.org'
DOMAIN_NAME_HTTP = 'https://mydomain.org'
DOMAIN_NAKED = DOMAIN_NAME_HTTP.replace("http://", "")
DATA_UPLOAD_MAX_MEMORY_SIZE = None

nginx.conf

server {
  listen                *:80;
  listen              443 ssl;
  server_name         mydomain.org;
  ssl_certificate     /etc/ssl/cert.pem;
  ssl_certificate_key /etc/ssl/cert.key;
  client_max_body_size 10024M;
  client_body_buffer_size 10024M;
  client_body_timeout 120;
  add_header X-Clacks-Overhead "GNU Terry Pratchett";
  add_header X-Clacks-Overhead "GNU Terry Pratchet";
  add_header Access-Control-Allow-Origin *;
  add_header 'Access-Control-Allow-Credentials' 'true';
  add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS';
  add_header 'Access-Control-Allow-Headers' 'Authorization,DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type';
  location /images {
    alias /var/www/images;
  }
  location ~* \.(php|aspx|myadmin|asp)$ {
    deny all;
  }
  location / {
    include /etc/nginx/uwsgi_params.par;
    uwsgi_pass uwsgi:3031;
    uwsgi_max_temp_file_size 10024m;
    # troubleshooting large upload issue
    uwsgi_read_timeout 36000;
    client_max_body_size 5g;
  }
  location /static {
    alias /var/www/static;
  }
  location /upload {
        upload_pass   /api/uploads/complete/;
        upload_store /var/www/images/_upload 1;        
        upload_store_access user:rw group:rw all:rw;
        upload_set_form_field $upload_field_name.name "$upload_file_name";
        upload_set_form_field $upload_field_name.content_type "$upload_content_type";
        upload_set_form_field $upload_field_name.path "$upload_tmp_path";
        upload_aggregate_form_field "$upload_field_name.md5" "$upload_file_md5";
        upload_aggregate_form_field "$upload_field_name.size" "$upload_file_size";
        upload_pass_form_field "^submit$|^description$";
        upload_pass_form_field "^SREGISTRY_EVENT$";
        upload_pass_form_field "^collection$";
        upload_pass_form_field "^name$";
        upload_pass_form_field "^tag$";
        upload_cleanup 400-599;
        # troubleshooting large upload issue
        uwsgi_read_timeout 36000;
        client_max_body_size 5g;
    }
}
vsoch commented 4 years ago

The sregistry tool I'm eventually deprecating in favor of the singularity client to push images, so could you please post the full logs for pushing with singularity library:// endpoint? Thanks!

n8wood-brown commented 4 years ago
$ singularity push -U fmri.sif library://<user>/n8/fmri:200318b
WARNING: Skipping container verifying
 546.31 MiB / 4.53 GiB [=======>-----------------------------------------------------------]  11.78% 3.24 MiB/s 2m48s
FATAL:   Unable to push image to library: error uploading image: Put "https://mydomain.org/v2/push/imagefile/20/07752669-c23a-4313-b0fb-cbfc69a37ab3": write tcp <ip>:52766-><ip>:443: write: connection reset by peer

nginx logs

<ip> - - [19/Mar/2020:21:10:24 +0000] "GET /assets/config/config.prod.json HTTP/1.1" 200 505 "-" "Singularity/3.5.3 (Linux amd64) Go/1.14" "-"
<ip> - - [19/Mar/2020:21:10:24 +0000] "GET /assets/config/config.prod.json HTTP/1.1" 200 505 "-" "Singularity/3.5.3 (Linux amd64) Go/1.14" "-"
<ip> - - [19/Mar/2020:21:10:43 +0000] "GET /v1/entities/<user> HTTP/1.1" 200 311 "-" "Go-http-client/1.1" "-"
<ip> - - [19/Mar/2020:21:10:43 +0000] "GET /v1/collections/<user>/n8 HTTP/1.1" 200 396 "-" "Go-http-client/1.1" "-"
<ip> - - [19/Mar/2020:21:10:43 +0000] "GET /v1/containers/<user>/n8/fmri HTTP/1.1" 200 667 "-" "Go-http-client/1.1" "-"
<ip> - - [19/Mar/2020:21:10:43 +0000] "GET /v1/images/<user>/n8/fmri:sha256.92278b7c046c0acf0952b3e1663b8abb819c260e8a96705bad90833d87ca0874?arch=amd64 HTTP/1.1" 200 641 "-" "Go-http-client/1.1" "-"
<ip> - - [19/Mar/2020:21:10:43 +0000] "GET /version HTTP/1.1" 200 58 "-" "Go-http-client/1.1" "-"
<ip>- - [19/Mar/2020:21:10:43 +0000] "POST /v2/imagefile/20/_multipart HTTP/1.1" 404 0 "-" "Go-http-client/1.1" "-"
<ip> - - [19/Mar/2020:21:10:43 +0000] "POST /v2/imagefile/20 HTTP/1.1" 200 109 "-" "Go-http-client/1.1" "-"
<ip>- - [19/Mar/2020:21:13:31 +0000] "PUT /v2/push/imagefile/20/07752669-c23a-4313-b0fb-cbfc69a37ab3 HTTP/1.1" 400 0 "-" "Go-http-client/1.1" "-"

uwsgi logs

[pid: 39|app: 0|req: 12/59] 1<ip> () {32 vars in 446 bytes} [Thu Mar 19 16:10:24 2020] GET /assets/config/config.prod.json => generated 505 bytes in 13 msecs (HTTP/1.1 200) 3 headers in 109 bytes (1 switches on core 3)
[pid: 40|app: 0|req: 18/60] <ip> () {32 vars in 446 bytes} [Thu Mar 19 16:10:24 2020] GET /assets/config/config.prod.json => generated 505 bytes in 6 msecs (HTTP/1.1 200) 3 headers in 109 bytes (1 switches on core 0)
GET NamedEntityView
<QueryDict: {}>
<user>
[pid: 38|app: 0|req: 12/61] <ip> () {34 vars in 480 bytes} [Thu Mar 19 16:10:43 2020] GET /v1/entities/<user> => generated 311 bytes in 38 msecs (HTTP/1.1 200) 5 headers in 142 bytes (1 switches on core 2)
GET GetNamedCollectionView
[pid: 38|app: 0|req: 13/62] <ip> () {34 vars in 492 bytes} [Thu Mar 19 16:10:43 2020] GET /v1/collections/<user>/n8 => generated 396 bytes in 76 msecs (HTTP/1.1 200) 5 headers in 142 bytes (1 switches on core 1)
GET GetNamedContainerView
[pid: 40|app: 0|req: 19/63] <ip> () {34 vars in 500 bytes} [Thu Mar 19 16:10:43 2020] GET /v1/containers/<user>/n8/fmri => generated 667 bytes in 90 msecs (HTTP/1.1 200) 5 headers in 142 bytes (1 switches on core 3)
GET PushNamedContainerView
[pid: 40|app: 0|req: 20/64] <ip> () {34 vars in 657 bytes} [Thu Mar 19 16:10:43 2020] GET /v1/images/<user>/n8/fmri:sha256.92278b7c046c0acf0952b3e1663b8abb819c260e8a96705bad90833d87ca0874?arch=amd64 => generated 641 bytes in 68 msecs (HTTP/1.1 200) 5 headers in 142 bytes (1 switches on core 2)
[pid: 40|app: 0|req: 21/65] <ip> () {34 vars in 448 bytes} [Thu Mar 19 16:10:43 2020] GET /version => generated 58 bytes in 3 msecs (HTTP/1.1 200) 5 headers in 141 bytes (1 switches on core 1)
POST RequestMultiPartPushImageFileView
Not Found: /v2/imagefile/20/_multipart
[pid: 37|app: 0|req: 20/66] <ip> () {36 vars in 514 bytes} [Thu Mar 19 16:10:43 2020] POST /v2/imagefile/20/_multipart => generated 0 bytes in 5 msecs (HTTP/1.1 404) 4 headers in 110 bytes (1 switches on core 2)
POST RequestPushImageFileView
[pid: 39|app: 0|req: 13/67] <ip> () {36 vars in 494 bytes} [Thu Mar 19 16:10:43 2020] POST /v2/imagefile/20 => generated 109 bytes in 41 msecs (HTTP/1.1 200) 5 headers in 137 bytes (1 switches on core 0)

Thanks for your help!

vsoch commented 4 years ago

I can only really help if I can reproduce your issue (which I cannot). If it helps, it looks a little like this: https://github.com/sylabs/singularity/pull/4850#issuecomment-563346984.

it also seems that you are getting the same error for both singularity and sregistry, so the client is not relevant. The 443 message seems relevant though to be permissions, and perhaps this is SSL related, something about the version of SSL that you are using on your host (the bit I can't reproduce). See https://github.com/dirkjanm/PrivExchange/issues/13#issuecomment-461569851. Anyway, not sure I can help. Good luck!

vsoch commented 4 years ago

hey @n8wood-brown could you please provide me a recipe to build / pull an image of this size? I just tested with 1.3 GB and it worked okay - during development I was testing with an 8GB image and that was working too. I'm thinking that this is specific to your internet connection or Singularity, but I want to be absolutely sure by trying the push myself.

n8wood-brown commented 4 years ago

I'm still not getting anywhere with this. The openssl s_client check is successful. Is it possible to test this using https:// on localhost with a self-signed cert? This system is on a private network, could that have anything to do with it?

vsoch commented 4 years ago

I definitely think so - the way that I usually test and develop is on localhost, and I compile Singularity to allow for http:// for the remote endpoint (the link is provided in the docs on this page) https://singularityhub.github.io/sregistry/docs/client#singularity-push. Then you can use 127.0.0.1 as the remote, and see if the error reproduces there. If that works, then it's some issue with your setup.

n8wood-brown commented 4 years ago

@vsoch - I recompiled Singularity and am still seeing the error. I ran a packet capture on the server to confirm that there were no connections over port 443.

Here's the image file I was testing with (~4.4GB): https://brownbox.brown.edu/download.php?hash=9b8db035

Thanks!

vsoch commented 4 years ago

hey @n8wood-brown - I think this might be a memory issue? The reason is because when I first tried (not with having a lot of memory free) it started to upload, and then I got an EOF error. Once I freed up some memory and tried again, I didn't have any issue (although it took a while for the upload to finalize).

vanessa@vanessa-ThinkPad-T490s:~/Downloads$ singularity push -U fmriprep-20.0.0.simg library://vsoch/test/fmriprep:20.0.0
WARNING: Skipping container verifying
4.5GiB / 4.5GiB [==============================================================================] 100 % 1.3 GiB/s 0s

You can see the image in my collection folder:

$ tree images/test/
images/test/
├── fmriprep-sha256.92278b7c046c0acf0952b3e1663b8abb819c260e8a96705bad90833d87ca0874.sif
└── trax_centos-sha256.72a69eb0b4963f472058412477cb6b28d074627c409a0f1de8a144f276804c6f.sif

0 directories, 2 files

And my collection interface: image

I'm sorry that I'm not able to reproduce your error, but I suspect it has something to do with the size of the image (and what Singularity is doing to send it to the registry, note that I'm not privy to how this works but since they added multipart I suspect it's done in one streamed chunk) and possibly your memory or setup of your host. If there is something else you'd like me to test to try and reproduce I'd be happy to, but I am afraid I cannot help if I am unable to reproduce.

vsoch commented 4 years ago

Let me know if you'd like any information about my host, in case that helps.

n8wood-brown commented 4 years ago

@vsoch - OK I started over from scratch and didn't use the Ansible set up this time around. I'm at least back to where I started. Pushing with http works again but https still does not. How do you test locally with a self-signed cert? I'm getting this:

$ singularity push -U /root/fmri.sif library://<user>/n8/fmri:200318b
FATAL:   Unable to get library service URI: error making request to server: Get https://localhost/assets/config/config.prod.json: x509: certificate signed by unknown authority
vsoch commented 4 years ago

For allowing http, please follow the instructions that I linked here https://singularityhub.github.io/sregistry/docs/client#singularity-push. There is a link to the exact line in Singularity that you'll need to change before recompiling to allow for http. You shouldn't need to use a certificate, I'm not actually privy to the checks that Singularity does for them but it looks like it's not happy with self signed.

vsoch commented 4 years ago

image https://github.com/sylabs/singularity/blob/5e483be4af2e120e646d33f0757e855c8d3be2da/internal/pkg/remote/remote.go#L237

vsoch commented 4 years ago

I'm not sure that it makes a difference, but I also used 127.0.0.1 instead of localhost.

n8wood-brown commented 4 years ago

I actually did compile a version of singularity with that change and it does work witth http. But I'm hoping to test locally with https and a self-signed cert so I can see if there's anything wrong with my real cert. I guess that's not an option.

vsoch commented 4 years ago

Ah okay. I haven't done anything with self signed certs (I usually go from http to certbot signed ones) so I'm afraid I don't have insight there. But I do think it would make sense (for a sanity check) to try the setup and push with just http.

n8wood-brown commented 4 years ago

I'm planning to test this with Certbot once I have access to an Internet accessible system that can be verified by Let's Encrypt.

vsoch commented 4 years ago

ah, good luck! I guess I'm lucky, I generally deploy to a cloud instance so the hardest thing is navigating the slippery white interfaces to find networking / DNS configuration. Keep me updated!

n8wood-brown commented 4 years ago

No luck on a new server set up with Certbot, uploads still die around 545MB. I'm still not sure if this a problem on the client side or server side. If you still have that large container I posted, would you mind seeing if you have any success? It's configured with only Google oauth at the moment. https://pshubcit.services.brown.edu

vsoch commented 4 years ago

@n8wood-brown please see #298 this should greatly help your issue! I wanted to get it done asap because I remembered you were having trouble.

I just created an account, added the host as a remote, verified my token, and I'm pushing a large container (4.5GB):

$ singularity push -U big.sif library://vsochat/bigcontainers/chonker:latest
WARNING: Skipping container verifying
544.5MiB / 4.5GiB [========>---------------------------------------------------------------------] 12 % 1.9 MiB/s 36m21s
FATAL:   Unable to push image to library: error uploading image: Put https://pshubcit.services.brown.edu/v2/push/imagefile/3/d10a7ad8-8cb6-482d-9e3c-7a1fb771861c: write tcp 10.0.0.196:39952->128.148.254.126:443: write: connection reset by peer

I reproduced what you are seeing, and I would suspect this is an issue due to the size of your server - if the memory limit cannot hold the entire file, it would be logical that it would fail. I've also tested sregistry many times recently on my current host, and I don't run into this issue (so it must be an issue of memory where it is hosted).

Do you want to try #298 with multipart uploads to see if it helps? It would mean signed URLs, the uploads being done in parts, and if absolutely needed, you could host or scale the Minio for storage separately (and not burden the uwsgi container). I am hopeful that this PR will fix this issue for you, sorry it's been such a trouble!

n8wood-brown commented 4 years ago

@vsoch - I just doubled the memory to 32GB and that didn't help. I'll look into the mulipart uploads link you posted.

Thank you so much for the attention you've put into this!

n8wood-brown commented 4 years ago

I wasn't able to figure this out so we ended up allowing http instead. Thank you @vsoch for your help!

vsoch commented 4 years ago

Alrighty, but please try the new Minio backend if you want https, it works remarkably better than my original implementation.