google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.63k stars 1.29k forks source link

gvisor prevents AMQP sockets from opening (TCP_SYNCNT) #1441

Closed mcowger closed 4 years ago

mcowger commented 4 years ago

When running a simple application to submit jobs via AMQP to a Celery server, gvisor prevents the sockets from being opened to send the data. Debug logs are attached.

Base simple application that listens via HTTP and makes request, along with python requirements and Dockerfile: https://gist.github.com/mcowger/7d4ab07a75dc1ddddd1f1fb20dc5d8fc

When running this compiled docker image under regular containerd, it runs fine. When run under the runsc runtime, it fails. When run under Google Cloud Run (where I actually first encountered the issue), the error is easier to see (though gvisor tries to disclaim responsibility):

Container Sandbox: Unsupported syscall getsockopt(0x5,0x6,0xa,0x3ee2d2dfb3c0,0x3ee2d2dfb3c4,0x0). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/c/linux/amd64/getsockopt for more information.

Other useful information to include is:

runsc version release-20191213.0 spec: 1.0.1-dev

Server: Docker Engine - Community Engine: Version: 19.03.5 API version: 1.40 (minimum version 1.12) Go version: go1.12.12 Git commit: 633a0ea838 Built: Wed Nov 13 07:24:29 2019 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.10 GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339 runc: Version: 1.0.0-rc8+dev GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657 docker-init: Version: 0.18.0 GitCommit: fec3683

Linux devbox 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

docker run -p 8080:8080 celery  # works fine
docker run --runtime=runsc -p 8080:8080 celery

runsc.bug.zip

hbhasker commented 4 years ago

I believe the real reason this is failing is that we don't support SO_DEFER_ACCEPT.

I1219 22:53:14.348108 2791 x:0] [ 3] python E setsockopt(0x5 socket:[3], 0x1, 0x9, 0x7faf9cffb600, 0x4) I1219 22:53:14.348127 2791 x:0] [ 3] python X setsockopt(0x5 socket:[3], 0x1, 0x9, 0x7faf9cffb600, 0x4) = 0x0 (7.473µs) I1219 22:53:14.348185 2791 x:0] [ 3] python E getsockopt(0x5 socket:[3], 0x6, 0x9, 0x7faf9cffb3c0, 0x7faf9cffb3c4) I1219 22:53:14.348265 2791 x:0] Unsupported syscall: getsockopt, regs: amd64: I1219 22:53:14.348592 2791 x:0] [ 3] python X getsockopt(0x5 socket:[3], 0x6, 0x9, 0x7faf9cffb3c0, 0x7faf9cffb3c4) = 0x0 errno=92 (protocol not available) (356.828µs) <---- This is SO_DEFER_ACCEPT

I1219 22:53:14.348725 2791 x:0] [ 3] python E close(0x5 socket:[3]) I1219 22:53:14.348834 2791 x:0] [ 3] python X close(0x5 socket:[3]) = 0x0 (

That said its rather strange that its trying to set SO_DEFER_ACCEPT on a connecting socket and not a listening one.

dinvlad commented 4 years ago

Just stumbled on it as well with Celery. Do I understand it right that this syscall is simply not implemented, i.e. it's not a matter of a missing capability that could be granted to the container? Thanks!

hbhasker commented 4 years ago

SO_DEFER_ACCEPT is now supported. But thats likely not the reason why it's failing. Could you try again and if it fails provide a full strace log and repro steps.

hbhasker commented 4 years ago

Actually nm I see the steps above. Let me try then and see .

xixa commented 4 years ago

Not sure if that gives any insight on the issue, but I'm getting the exact same warning with my Elixir application. The application only serves a GraphQL endpoint and I get several lines of "Unsupported syscall getsockopt" for every API request.

vinycelebryts commented 4 years ago

Hey guys, I'm just fronting the same problem. Any light on it ?

hbhasker commented 4 years ago

Apologies this fell of my radar. Let me spend sometime to look at this right now.

hbhasker commented 4 years ago

It doesn't look like an issue with running the image. It fails to connect and from the error it seems like its failing to resolve the amqp:// address. Let me take a look

docker run --runtime=runsc-d-p 8080:8080 celery Unable to find image 'celery:latest' locally latest: Pulling from library/celery ef0380f84d05: Pull complete ada810c79ed7: Pull complete 4608a1c4fe47: Pull complete 58086cbb21fb: Pull complete a7bccb4a3faa: Pull complete 9de06a08ec25: Pull complete ad6feb8c6a6b: Pull complete 7568ca85d492: Pull complete 2d6f458f7411: Pull complete Digest: sha256:5c236059192a0389a2be21fc42d8db59411d953b7af5457faf501d4eec32dc31 Status: Downloaded newer image for celery:latest [2020-05-06 15:39:34,841: WARNING/MainProcess] /usr/local/lib/python3.5/site-packages/celery/apps/worker.py:161: CDeprecationWarning: Starting from version 3.2 Celery will refuse to accept pickle by default.

The pickle serializer is a security concern as it may give attackers the ability to execute any command. It's important to secure your broker from unauthorized access when using pickle, so we think that enabling pickle should require a deliberate action and not be the default choice.

If you depend on pickle then you should set a setting to disable this warning and to be sure that everything will continue working when you upgrade to Celery 3.2::

CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']

You must only enable the serializers that you will actually use.

warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED)) [2020-05-06 15:39:36,647: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@rabbit:5672//: [Errno -2] Name or service not known. Trying again in 2.00 seconds...

[2020-05-06 15:39:38,857: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@rabbit:5672//: [Errno -2] Name or service not known. Trying again in 4.00 seconds...

[2020-05-06 15:39:43,046: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@rabbit:5672//: [Errno -2] Name or service not known. Trying again in 6.00 seconds...

mcowger commented 4 years ago

Please note that the issue you are seeing with name resolution is not the one I reported. Name resolution within the container is not the challenge we are seeing with the failing syscall.

On Wed, May 6, 2020 at 8:41 AM Bhasker Hariharan notifications@github.com wrote:

It doesn't look like an issue with running the image. It fails to connect and from the error it seems like its failing to resolve the amqp:// address. Let me take a look

docker run --runtime=runsc-d-p 8080:8080 celery Unable to find image 'celery:latest' locally latest: Pulling from library/celery ef0380f84d05: Pull complete ada810c79ed7: Pull complete 4608a1c4fe47: Pull complete 58086cbb21fb: Pull complete a7bccb4a3faa: Pull complete 9de06a08ec25: Pull complete ad6feb8c6a6b: Pull complete 7568ca85d492: Pull complete 2d6f458f7411: Pull complete Digest: sha256:5c236059192a0389a2be21fc42d8db59411d953b7af5457faf501d4eec32dc31 Status: Downloaded newer image for celery:latest [2020-05-06 15:39:34,841: WARNING/MainProcess] /usr/local/lib/python3.5/site-packages/celery/apps/worker.py:161: CDeprecationWarning: Starting from version 3.2 Celery will refuse to accept pickle by default.

The pickle serializer is a security concern as it may give attackers the ability to execute any command. It's important to secure your broker from unauthorized access when using pickle, so we think that enabling pickle should require a deliberate action and not be the default choice.

If you depend on pickle then you should set a setting to disable this warning and to be sure that everything will continue working when you upgrade to Celery 3.2::

CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']

You must only enable the serializers that you will actually use.

warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED)) [2020-05-06 15:39:36,647: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@rabbit https://github.com/rabbit:5672//: [Errno -2] Name or service not known. Trying again in 2.00 seconds...

[2020-05-06 15:39:38,857: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@rabbit https://github.com/rabbit:5672//: [Errno -2] Name or service not known. Trying again in 4.00 seconds...

[2020-05-06 15:39:43,046: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@rabbit https://github.com/rabbit:5672//: [Errno -2] Name or service not known. Trying again in 6.00 seconds...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/gvisor/issues/1441#issuecomment-624725552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXCTHK2VKZXKDDJU2SCMTRQGAKDANCNFSM4J5SIQCQ .

-- -- Matt

hbhasker commented 4 years ago

I started up a rabbitmq server locally and it works. Here are the commands

docker run -d --hostname localhost --name some-rabbit rabbitmq:3

docker run --runtime=runsc --link some-rabbit:rabbit -p 8080:8080 celery [2020-05-06 15:50:19,235: WARNING/MainProcess] /usr/local/lib/python3.5/site-packages/celery/apps/worker.py:161: CDeprecationWarning: Starting from version 3.2 Celery will refuse to accept pickle by default.

The pickle serializer is a security concern as it may give attackers the ability to execute any command. It's important to secure your broker from unauthorized access when using pickle, so we think that enabling pickle should require a deliberate action and not be the default choice.

If you depend on pickle then you should set a setting to disable this warning and to be sure that everything will continue working when you upgrade to Celery 3.2::

CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']

You must only enable the serializers that you will actually use.

warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED)) [2020-05-06 15:50:21,965: WARNING/MainProcess] celery@1e91a5dba68f ready.

hbhasker commented 4 years ago

@mcowger hmm I scanned the logs for the repro I tried and I don't see celery making that call at all. Could you provide a bit more details on your rabbitMQ setup and how you are connecting to it.

mcowger commented 4 years ago

If you aren’t actually submitting jobs to the AMQP server via celery (like the reproduction instructions above) then the test is not complete. The issue is not running celery under docker, the issue to issuing AMQP message to celery from python running under runc. There is a clear sample code to reproduce in the gist linked above.

On Wed, May 6, 2020 at 8:52 AM Bhasker Hariharan notifications@github.com wrote:

I started up a rabbitmq server locally and it works. Here are the commands

docker run -d --hostname localhost --name some-rabbit rabbitmq:3

docker run --runtime=runsc --link some-rabbit:rabbit -p 8080:8080 celery [2020-05-06 15:50:19,235: WARNING/MainProcess] /usr/local/lib/python3.5/site-packages/celery/apps/worker.py:161: CDeprecationWarning: Starting from version 3.2 Celery will refuse to accept pickle by default.

The pickle serializer is a security concern as it may give attackers the ability to execute any command. It's important to secure your broker from unauthorized access when using pickle, so we think that enabling pickle should require a deliberate action and not be the default choice.

If you depend on pickle then you should set a setting to disable this warning and to be sure that everything will continue working when you upgrade to Celery 3.2::

CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']

You must only enable the serializers that you will actually use.

warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED)) [2020-05-06 15:50:21,965: WARNING/MainProcess] celery@1e91a5dba68f ready.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/gvisor/issues/1441#issuecomment-624731880, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXCTEIZCQWTO3WZ4VR5RTRQGBSDANCNFSM4J5SIQCQ .

-- -- Matt

hbhasker commented 4 years ago

Ah apologies I read the report again so its not celery that's failing but your python application that is connecting to celery?

hbhasker commented 4 years ago

Let me try that sorry about that.

hbhasker commented 4 years ago

I0506 16:21:29.306868 175117 strace.go:628] [ 3] python X getsockopt(0x5 socket:[19], SOL_TCP, TCP_DEFER_ACCEPT, 0x7fc1f83fb3c0 {value=0}, 0x7fc1f83fb3c4 {length=4}) = 0x0 (227.514µs) I0506 16:21:29.307040 175117 strace.go:590] [ 3] python E getsockopt(0x5 socket:[19], SOL_TCP, TCP_SYNCNT, 0x7fc1f83fb3c0, 0x7fc1f83fb3c4 {length=4}) I0506 16:21:29.307097 175117 strace.go:628] [ 3] python X getsockopt(0x5 socket:[19], SOL_TCP, TCP_SYNCNT, 0x7fc1f83fb3c0, 0x7fc1f83fb3c4 {length=4}) = 0x0 errno=92 (protocol not available) (20.743µs) I0506 16:21:29.307268 175117 strace.go:578] [ 3] python E close(0x5 socket:[19])

Looks like its now not failing on TCP_DEFER_ACCEPT but on another socket option that we don't support yet. TCP_SYNCNT.

hbhasker commented 4 years ago

TCP_SYNCNT (since Linux 2.4) Set the number of SYN retransmits that TCP should send before aborting the attempt to connect. It cannot exceed 255. This option should not be used in code intended to be portable.

Netstack doesn't control retries using a count of SYN's but its just bounded by time. Let me stub out this option for now so that it doesn't return an error and see if it makes progress.

hbhasker commented 4 years ago

Strange now I stubbed that out and it flunks in another getsockopt TCP_WINDOW_CLAMP but now I don't see it making the TCP_SYNCNT call.

I0506 16:41:55.278268  190228 compat.go:122] Unsupported syscall: getsockopt, regs: amd64:{rax:18446744073709551578 rbx:139660612445664 rcx:139660652557770 rdx:10 rsi:6 rdi:5 rsp:139660623852456 rbp:139660623852476 r8:139660623852484 r9:39 r10:139660623852480 r11:659 r12:139660637795321 r13:139660659080256 r14:139660658854592 r15:139660525944904 rip:139660652557770 rflags:659 orig_rax:55 cs:51 ss:43 fs_base:139660623873792}
I0506 16:41:55.279419  190228 strace.go:628] [   3] python X getsockopt(0x5 socket:[3], SOL_TCP, TCP_WINDOW_CLAMP, 0x7f0545dfb3c0, 0x7f0545dfb3c4 {length=4}) = 0x0 errno=92 (protocol not available) (1.16589ms)
hbhasker commented 4 years ago

Okay stubbing those two out seems to have fixed it. I now see packets going back and forth w/ AMQP (that said the curl request does not return.) But that seems unrelated to a gvisor issue and probably just that something about my local setup of rabbitmq+ celery is probably not correct.

I see packets like this going back and forth.

 5345 3394.254142701  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323880087 Win=64128 Len=0 TSval=1949979946 TSecr=3707238452
 5346 3394.254496590  192.168.9.3 → 192.168.9.2  AMQP 150 Content-Header type=application/json 
 5347 3394.254532797  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323880171 Win=64128 Len=0 TSval=1949979946 TSecr=3707238452
 5348 3394.254842587  192.168.9.3 → 192.168.9.2  AMQP 344 Content-Body  (application/json)
 5349 3394.254872180  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323880449 Win=64128 Len=0 TSval=1949979947 TSecr=3707238453
 5350 3394.255804983  192.168.9.2 → 192.168.9.3  AMQP 481 Basic.Deliver x=celeryev rk=worker.heartbeat Content-Header type=application/json Content-Body  (application/json)
 5351 3394.256146501  192.168.9.3 → 192.168.9.2  TCP 66 38963 → 5672 [ACK] Seq=586704009 Ack=669080124 Win=1048064 Len=0 TSval=865816751 TSecr=1949979948
^[[5~ 5352 3396.259715829  192.168.9.3 → 192.168.9.2  AMQP 107 Basic.Publish x=celeryev rk=worker.heartbeat 
 5353 3396.259788466  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323880490 Win=64128 Len=0 TSval=1949981952 TSecr=3707240458
 5354 3396.260151838  192.168.9.3 → 192.168.9.2  AMQP 150 Content-Header type=application/json 
 5355 3396.260184177  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323880574 Win=64128 Len=0 TSval=1949981952 TSecr=3707240458
 5356 3396.260505283  192.168.9.3 → 192.168.9.2  AMQP 344 Content-Body  (application/json)
 5357 3396.260533305  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323880852 Win=64128 Len=0 TSval=1949981952 TSecr=3707240458
 5358 3396.261005551  192.168.9.2 → 192.168.9.3  AMQP 481 Basic.Deliver x=celeryev rk=worker.heartbeat Content-Header type=application/json Content-Body  (application/json)
 5359 3396.261265557  192.168.9.3 → 192.168.9.2  TCP 66 38963 → 5672 [ACK] Seq=586704009 Ack=669080539 Win=1048064 Len=0 TSval=865818756 TSecr=1949981953
 5360 3397.354087208 02:42:bd:d6:cd:ca → 02:42:c0:a8:09:03 ARP 42 Who has 192.168.9.3? Tell 192.168.9.1
 5361 3397.354499185 02:42:c0:a8:09:03 → 02:42:bd:d6:cd:ca ARP 42 192.168.9.3 is at 02:42:c0:a8:09:03
 5362 3398.265989317  192.168.9.3 → 192.168.9.2  AMQP 107 Basic.Publish x=celeryev rk=worker.heartbeat 
 5363 3398.266076361  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323880893 Win=64128 Len=0 TSval=1949983958 TSecr=3707242464
 5364 3398.266507185  192.168.9.3 → 192.168.9.2  AMQP 150 Content-Header type=application/json 
 5365 3398.266549194  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323880977 Win=64128 Len=0 TSval=1949983958 TSecr=3707242464
 5366 3398.266895787  192.168.9.3 → 192.168.9.2  AMQP 344 Content-Body  (application/json)
 5367 3398.266927073  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323881255 Win=64128 Len=0 TSval=1949983959 TSecr=3707242465
 5368 3398.267581418  192.168.9.2 → 192.168.9.3  AMQP 481 Basic.Deliver x=celeryev rk=worker.heartbeat Content-Header type=application/json Content-Body  (application/json)
 5369 3398.267906279  192.168.9.3 → 192.168.9.2  TCP 66 38963 → 5672 [ACK] Seq=586704009 Ack=669080954 Win=1048064 Len=0 TSval=865820763 TSecr=1949983959
 5370 3400.273521509  192.168.9.3 → 192.168.9.2  AMQP 107 Basic.Publish x=celeryev rk=worker.heartbeat 
 5371 3400.273586881  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323881296 Win=64128 Len=0 TSval=1949985965 TSecr=3707244471
 5372 3400.274033527  192.168.9.3 → 192.168.9.2  AMQP 150 Content-Header type=application/json 
 5373 3400.274110673  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323881380 Win=64128 Len=0 TSval=1949985966 TSecr=3707244472
 5374 3400.274504739  192.168.9.3 → 192.168.9.2  AMQP 344 Content-Body  (application/json)
 5375 3400.274553376  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323881658 Win=64128 Len=0 TSval=1949985966 TSecr=3707244472
 5376 3400.275106224  192.168.9.2 → 192.168.9.3  AMQP 481 Basic.Deliver x=celeryev rk=worker.heartbeat Content-Header type=application/json Content-Body  (application/json)
 5377 3400.275385423  192.168.9.3 → 192.168.9.2  TCP 66 38963 → 5672 [ACK] Seq=586704009 Ack=669081369 Win=1048064 Len=0 TSval=865822770 TSecr=1949985967
 5378 3402.279475689  192.168.9.3 → 192.168.9.2  AMQP 107 Basic.Publish x=celeryev rk=worker.heartbeat 
 5379 3402.279529683  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323881699 Win=64128 Len=0 TSval=1949987971 TSecr=3707246477
 5380 3402.279822370  192.168.9.3 → 192.168.9.2  AMQP 150 Content-Header type=application/json 
 5381 3402.279856663  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323881783 Win=64128 Len=0 TSval=1949987972 TSecr=3707246478
 5382 3402.280145915  192.168.9.3 → 192.168.9.2  AMQP 342 Content-Body  (application/json)
 5383 3402.280185891  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323882059 Win=64128 Len=0 TSval=1949987972 TSecr=3707246478
 5384 3402.280600908  192.168.9.2 → 192.168.9.3  AMQP 479 Basic.Deliver x=celeryev rk=worker.heartbeat Content-Header type=application/json Content-Body  (application/json)
 5385 3402.280888065  192.168.9.3 → 192.168.9.2  TCP 66 38963 → 5672 [ACK] Seq=586704009 Ack=669081782 Win=1048064 Len=0 TSval=865824776 TSecr=1949987972
 5386 3404.285404506  192.168.9.3 → 192.168.9.2  AMQP 107 Basic.Publish x=celeryev rk=worker.heartbeat 
 5387 3404.285473204  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323882100 Win=64128 Len=0 TSval=1949989977 TSecr=3707248483
 5388 3404.285878425  192.168.9.3 → 192.168.9.2  AMQP 150 Content-Header type=application/json 
 5389 3404.285916112  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323882184 Win=64128 Len=0 TSval=1949989978 TSecr=3707248484
 5390 3404.286373040  192.168.9.3 → 192.168.9.2  AMQP 344 Content-Body  (application/json)
 5391 3404.286409072  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323882462 Win=64128 Len=0 TSval=1949989978 TSecr=3707248484
 5392 3404.287043351  192.168.9.2 → 192.168.9.3  AMQP 481 Basic.Deliver x=celeryev rk=worker.heartbeat Content-Header type=application/json Content-Body  (application/json)
 5393 3404.287336718  192.168.9.3 → 192.168.9.2  TCP 66 38963 → 5672 [ACK] Seq=586704009 Ack=669082197 Win=1048064 Len=0 TSval=865826782 TSecr=1949989979
 5394 3406.291692821  192.168.9.3 → 192.168.9.2  AMQP 107 Basic.Publish x=celeryev rk=worker.heartbeat 
 5395 3406.291758482  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323882503 Win=64128 Len=0 TSval=1949991984 TSecr=3707250490
 5396 3406.292168290  192.168.9.3 → 192.168.9.2  AMQP 150 Content-Header type=application/json 
 5397 3406.292199229  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323882587 Win=64128 Len=0 TSval=1949991984 TSecr=3707250490
 5398 3406.292491836  192.168.9.3 → 192.168.9.2  AMQP 344 Content-Body  (application/json)
 5399 3406.292518760  192.168.9.2 → 192.168.9.3  TCP 66 5672 → 38964 [ACK] Seq=3702026781 Ack=3323882865 Win=64128 Len=0 TSval=1949991984 TSecr=3707250490
 5400 3406.293005423  192.168.9.2 → 192.168.9.3  AMQP 481 Basic.Deliver x=celeryev rk=worker.heartbeat Content-Header type=application/json Content-Body  (application/json)
 5401 3406.293379361  192.168.9.3 → 192.168.9.2  TCP 66 38963 → 5672 [ACK] Seq=586704009 Ack=669082612 Win=1048064 Len=0 TSval=865828788 TSecr=1949991985
hbhasker commented 4 years ago

I will open two bugs to add TCP_SYNCNT and TCP_WINDOW_CLAMP. I think we can probably start by just stubbing them out to unblock this and then look at implementing it properly.

hbhasker commented 4 years ago
I0506 17:04:01.417799  211953 strace.go:584] [   3] python E socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP)
I0506 17:04:01.417898  211953 strace.go:622] [   3] python X socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 0x9 (47.271µs)
I0506 17:04:01.418085  211953 strace.go:584] [   3] python E fcntl(0x9 socket:[7], 0x1, 0x0)
I0506 17:04:01.418135  211953 strace.go:622] [   3] python X fcntl(0x9 socket:[7], 0x1, 0x0) = 0x1 (3.13µs)
I0506 17:04:01.418280  211953 strace.go:584] [   3] python E fcntl(0x9 socket:[7], 0x2, 0x1)
I0506 17:04:01.418314  211953 strace.go:622] [   3] python X fcntl(0x9 socket:[7], 0x2, 0x1) = 0x0 (3.584µs)
I0506 17:04:01.418429  211953 strace.go:584] [   3] python E ioctl(0x9 socket:[7], 0x5421, 0x7fa6e69fb64c)
I0506 17:04:01.418468  211953 strace.go:622] [   3] python X ioctl(0x9 socket:[7], 0x5421, 0x7fa6e69fb64c) = 0x0 (8.852µs)
I0506 17:04:01.418609  211953 strace.go:584] [   3] python E connect(0x9 socket:[7], 0x7fa6e69fb5f0 {Family: AF_INET, Addr: 192.168.9.1, Port: 8080}, 0x10)
I0506 17:04:01.418696  211953 strace.go:622] [   3] python X connect(0x9 socket:[7], 0x7fa6e69fb5f0 {Family: AF_INET, Addr: 192.168.9.1, Port: 8080}, 0x10) = 0x0 errno=115 (operation now in progress) (52
.179µs)
I0506 17:04:01.418810  211953 sniffer.go:448] send tcp 192.168.9.4:0 -> 192.168.9.1:0 len:40 id:0000 invalid packet: tcp data offset 40 larger than packet buffer length 20
I0506 17:04:01.418932  211953 strace.go:584] [   3] python E poll(0x7fa6e69fb4f8 [{FD: 0x9 socket:[7], Events: POLLOUT|POLLERR, REvents: ...}], 0x1, 0xfa0)
I0506 17:04:01.419036  211953 sniffer.go:448] recv tcp 192.168.9.1:0 -> 192.168.9.4:0 len:40 id:0000 invalid packet: tcp data offset 40 larger than packet buffer length 20
I0506 17:04:01.419271  211953 sniffer.go:448] send tcp 192.168.9.4:0 -> 192.168.9.1:0 len:32 id:0000 invalid packet: tcp data offset 32 larger than packet buffer length 20 gso: &{Type:1 NeedsCsum:true Cs
umOffset:16 MSS:1420 L3HdrLen:20 MaxSize:65536}
I0506 17:04:01.419371  211953 strace.go:622] [   3] python X poll(0x7fa6e69fb4f8 [{FD: 0x9 socket:[7], Events: POLLOUT|POLLERR|POLLHUP, REvents: POLLOUT}], 0x1, 0xfa0) = 0x1 (389.53µs)
I0506 17:04:01.419469  211953 strace.go:590] [   3] python E getsockopt(0x9 socket:[7], SOL_SOCKET, SO_ERROR, 0x7fa6e69fb4f8, 0x7fa6e69fb4fc {length=4})
I0506 17:04:01.419515  211953 strace.go:628] [   3] python X getsockopt(0x9 socket:[7], SOL_SOCKET, SO_ERROR, 0x7fa6e69fb4f8 {value=0}, 0x7fa6e69fb4fc {length=4}) = 0x0 (10.349µs)
I0506 17:04:01.419626  211953 strace.go:584] [   3] python E ioctl(0x9 socket:[7], 0x5421, 0x7fa6e69fb64c)
I0506 17:04:01.419667  211953 strace.go:622] [   3] python X ioctl(0x9 socket:[7], 0x5421, 0x7fa6e69fb64c) = 0x0 (5.978µs)
I0506 17:04:01.419778  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_SOCKET, SO_KEEPALIVE, 0x7fa6e69fb600 {value=1}, 0x4)
I0506 17:04:01.419844  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_SOCKET, SO_KEEPALIVE, 0x7fa6e69fb600, 0x4) = 0x0 (18.73µs)
I0506 17:04:01.419962  211953 strace.go:590] [   3] python E getsockopt(0x9 socket:[7], SOL_TCP, TCP_CORK, 0x7fa6e69fb3c0, 0x7fa6e69fb3c4 {length=4})
I0506 17:04:01.420003  211953 strace.go:628] [   3] python X getsockopt(0x9 socket:[7], SOL_TCP, TCP_CORK, 0x7fa6e69fb3c0 {value=0}, 0x7fa6e69fb3c4 {length=4}) = 0x0 (6.128µs)
I0506 17:04:01.420121  211953 strace.go:590] [   3] python E getsockopt(0x9 socket:[7], SOL_TCP, TCP_LINGER2, 0x7fa6e69fb3c0, 0x7fa6e69fb3c4 {length=4})
I0506 17:04:01.420160  211953 strace.go:628] [   3] python X getsockopt(0x9 socket:[7], SOL_TCP, TCP_LINGER2, 0x7fa6e69fb3c0 {value=60}, 0x7fa6e69fb3c4 {length=4}) = 0x0 (5.798µs)
I0506 17:04:01.420252  211953 strace.go:590] [   3] python E getsockopt(0x9 socket:[7], SOL_TCP, TCP_SYNCNT, 0x7fa6e69fb3c0, 0x7fa6e69fb3c4 {length=4})
I0506 17:04:01.420260  211953 sniffer.go:448] recv tcp 192.168.9.1:0 -> 192.168.9.4:0 len:32 id:ab60 invalid packet: tcp data offset 32 larger than packet buffer length 20
I0506 17:04:01.420290  211953 strace.go:628] [   3] python X getsockopt(0x9 socket:[7], SOL_TCP, TCP_SYNCNT, 0x7fa6e69fb3c0 {value=15}, 0x7fa6e69fb3c4 {length=4}) = 0x0 (4.663µs)
I0506 17:04:01.420419  211953 sniffer.go:448] send tcp 192.168.9.4:0 -> 192.168.9.1:0 len:32 id:0000 invalid packet: tcp data offset 32 larger than packet buffer length 20 gso: &{Type:1 NeedsCsum:true CsumOffset:16 MSS:1420 L3HdrLen:20 MaxSize:65536}
I0506 17:04:01.420492  211953 strace.go:590] [   3] python E getsockopt(0x9 socket:[7], SOL_TCP, TCP_QUICKACK, 0x7fa6e69fb3c0, 0x7fa6e69fb3c4 {length=4})
I0506 17:04:01.420562  211953 strace.go:628] [   3] python X getsockopt(0x9 socket:[7], SOL_TCP, TCP_QUICKACK, 0x7fa6e69fb3c0 {value=1}, 0x7fa6e69fb3c4 {length=4}) = 0x0 (32.187µs)
I0506 17:04:01.420655  211953 strace.go:590] [   3] python E getsockopt(0x9 socket:[7], SOL_TCP, TCP_DEFER_ACCEPT, 0x7fa6e69fb3c0, 0x7fa6e69fb3c4 {length=4})
I0506 17:04:01.420710  211953 strace.go:628] [   3] python X getsockopt(0x9 socket:[7], SOL_TCP, TCP_DEFER_ACCEPT, 0x7fa6e69fb3c0 {value=0}, 0x7fa6e69fb3c4 {length=4}) = 0x0 (5.901µs)
I0506 17:04:01.420799  211953 strace.go:590] [   3] python E getsockopt(0x9 socket:[7], SOL_TCP, TCP_WINDOW_CLAMP, 0x7fa6e69fb3c0, 0x7fa6e69fb3c4 {length=4})
I0506 17:04:01.420837  211953 strace.go:628] [   3] python X getsockopt(0x9 socket:[7], SOL_TCP, TCP_WINDOW_CLAMP, 0x7fa6e69fb3c0 {value=1048576}, 0x7fa6e69fb3c4 {length=4}) = 0x0 (4.57µs)
I0506 17:04:01.420928  211953 strace.go:590] [   3] python E getsockopt(0x9 socket:[7], SOL_TCP, TCP_MAXSEG, 0x7fa6e69fb3c0, 0x7fa6e69fb3c4 {length=4})
I0506 17:04:01.420965  211953 strace.go:628] [   3] python X getsockopt(0x9 socket:[7], SOL_TCP, TCP_MAXSEG, 0x7fa6e69fb3c0 {value=536}, 0x7fa6e69fb3c4 {length=4}) = 0x0 (4.75µs)
I0506 17:04:01.421058  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_KEEPINTVL, 0x7fa6e69fb4b0 {value=10}, 0x4)
I0506 17:04:01.421106  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_KEEPINTVL, 0x7fa6e69fb4b0, 0x4) = 0x0 (18.864µs)
I0506 17:04:01.421189  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_CORK, 0x7fa6e69fb4b0 {value=0}, 0x4)
I0506 17:04:01.421236  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_CORK, 0x7fa6e69fb4b0, 0x4) = 0x0 (17.548µs)
:I0506 17:04:01.421321  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_USER_TIMEOUT, 0x7fa6e69fb4b0 {value=1000}, 0x4)
I0506 17:04:01.421354  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_USER_TIMEOUT, 0x7fa6e69fb4b0, 0x4) = 0x0 (4.277µs)
I0506 17:04:01.421437  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_NODELAY, 0x7fa6e69fb4b0 {value=1}, 0x4)
I0506 17:04:01.421481  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_NODELAY, 0x7fa6e69fb4b0, 0x4) = 0x0 (15.492µs)
I0506 17:04:01.421579  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_KEEPCNT, 0x7fa6e69fb4b0 {value=9}, 0x4)
I0506 17:04:01.421625  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_KEEPCNT, 0x7fa6e69fb4b0, 0x4) = 0x0 (17.197µs)
I0506 17:04:01.421709  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_LINGER2, 0x7fa6e69fb4b0 {value=60}, 0x4)
I0506 17:04:01.421744  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_LINGER2, 0x7fa6e69fb4b0, 0x4) = 0x0 (6.304µs)
I0506 17:04:01.421828  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_SYNCNT, 0x7fa6e69fb4b0 {value=15}, 0x4)
I0506 17:04:01.421865  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_SYNCNT, 0x7fa6e69fb4b0, 0x4) = 0x0 (9.2µs)
I0506 17:04:01.421948  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_KEEPIDLE, 0x7fa6e69fb4b0 {value=60}, 0x4)
I0506 17:04:01.421992  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_KEEPIDLE, 0x7fa6e69fb4b0, 0x4) = 0x0 (15.499µs)
I0506 17:04:01.422076  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_QUICKACK, 0x7fa6e69fb4b0 {value=1}, 0x4)
I0506 17:04:01.422107  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_QUICKACK, 0x7fa6e69fb4b0, 0x4) = 0x0 (3.412µs)
I0506 17:04:01.422190  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_DEFER_ACCEPT, 0x7fa6e69fb4b0 {value=0}, 0x4)
I0506 17:04:01.422222  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_DEFER_ACCEPT, 0x7fa6e69fb4b0, 0x4) = 0x0 (3.713µs)
I0506 17:04:01.422304  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_WINDOW_CLAMP, 0x7fa6e69fb4b0 {value=1048576}, 0x4)
I0506 17:04:01.422341  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_WINDOW_CLAMP, 0x7fa6e69fb4b0, 0x4) = 0x0 (9.229µs)
I0506 17:04:01.422424  211953 strace.go:590] [   3] python E setsockopt(0x9 socket:[7], SOL_TCP, TCP_MAXSEG, 0x7fa6e69fb4b0 {value=536}, 0x4)
I0506 17:04:01.422467  211953 strace.go:628] [   3] python X setsockopt(0x9 socket:[7], SOL_TCP, TCP_MAXSEG, 0x7fa6e69fb4b0, 0x4) = 0x0 (14.656µs)
mcowger commented 4 years ago

that said the curl request does not return.

Unless celery is setup with workers (which it isn't by default) thats expected behaviorm, as there's no workers to complete the task request. I agree that the curl request isn't relevant to this issue, and that if you've get messages going in and gettng accepted then its probably good to go.

hbhasker commented 4 years ago

@prattmic pointed the following out, Looks like AMQP queries TCP options to figure out which ones are supported on what platform so it just sort of skips the setsockopt if a particular one is not supported.

https://github.com/celery/py-amqp/blob/ccbe683cfd30aef75118223088984f7c9ace43e4/amqp/transport.py#L215 https://github.com/celery/py-amqp/blob/ccbe683cfd30aef75118223088984f7c9ace43e4/amqp/transport.py#L195 https://github.com/celery/py-amqp/blob/79833d3e261ad89638ad41c997b31f29a61eb6bf/amqp/platform.py#L37

hbhasker commented 4 years ago

I believe this should now work. @mcowger Could you try and update the bug?

mcowger commented 4 years ago

Unfortunately I dont have the toolset setup/required to build gvisor from source, so I'll just accept your results. thanks!

dinvlad commented 4 years ago

May we know approximately when this fix could be available in GKE Sandbox? Thanks!

hbhasker commented 4 years ago

Typically GKE sandbox should get the update in the next 2-3 weeks.

mauronr commented 4 years ago

Typically GKE sandbox should get the update in the next 2-3 weeks.

Does this timeline apply to Cloud Run as well?

hbhasker commented 4 years ago

Typically yes but please note these are estimates, actual rollout times do vary a bit. We gVisor team have no real control over the rollout processes and CloudRun and GKE teams decide when they pick up a particular gVisor version based on the stability of their corresponding systems.

dinvlad commented 4 years ago

Thanks, a much appreciated insight!