Closed lusche closed 3 years ago
I'm not sure I understand the results. Can you eliminate somethings? java 11 and curl?
reactor-netty has some issues with curl. @smaldini or @violetagg?
I made a quick check with java 8 this morning. No difference. We also test this in different environments with very similar results.
Env I: Mac OS Mojave all service listen on localhost
Env II: Oracle Enterprise Linux 7.X services running on different Servers.
I try to use ab (apache benchmark) and i can recreate the result.
Result first run:
ab -c 1 -n 25 http://127.0.0.1:8080/baa/hello
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient).....done
Server Software:
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /baa/hello
Document Length: 5 bytes
Concurrency Level: 1
Time taken for tests: 0.258 seconds
Complete requests: 25
Failed requests: 0
Total transferred: 9650 bytes
HTML transferred: 125 bytes
Requests per second: 97.03 [#/sec] (mean)
Time per request: 10.306 [ms] (mean)
Time per request: 10.306 [ms] (mean, across all concurrent requests)
Transfer rate: 36.58 [Kbytes/sec] received
Result second run:
ab -c 1 -n 25 http://127.0.0.1:8080/baa/hello
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 10 requests completed
i tried gatling to reproduce this problem. i can't reproduce it locally but in my second Environment (OEL and 2 servers) it looks like i run in the same issue.
import scala.concurrent.duration._
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.jdbc.Predef._
class RecordedSimulation extends Simulation {
val httpProtocol = http
.baseURL("http://127.0.0.1:8080")
.inferHtmlResources()
.acceptHeader("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
.acceptEncodingHeader("gzip, deflate")
.acceptLanguageHeader("de,en-US;q=0.7,en;q=0.3")
.userAgentHeader("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0")
val headers_0 = Map("Upgrade-Insecure-Requests" -> "1")
val scn = scenario("RecordedSimulation")
.exec(http("request_0")
.get("/baa/hello")
.headers(headers_0))
setUp(scn.inject(
constantUsersPerSec(1) during(10 seconds),
nothingFor(1 seconds),
constantUsersPerSec(1) during(10 seconds),
nothingFor(3 seconds),
constantUsersPerSec(1) during(10 seconds),
nothingFor(5 seconds),
constantUsersPerSec(1) during(10 seconds),
nothingFor(7 seconds),
constantUsersPerSec(1) during(10 seconds),
nothingFor(9 seconds),
constantUsersPerSec(1) during(10 seconds),
nothingFor(11 seconds),
constantUsersPerSec(1) during(10 seconds)
)).protocols(httpProtocol)
}
@violetagg does this ring a bell?
@lusche I'm not able to reproduce the described behaviour. I receive always all responses.
@violetagg what can i do to enable you to reproduce this?
@lusche if you can provide tcpdump?
@violetagg sure, attached you find two tcpdumps. the "short" one i stopped after a missing response. the long one i let tcpdum an the req. running. You can see that in this one the server runs in an timeout. closes the connection and statrs a new one. We removes some certificate information from the dump and changed the IP's. Hope this helps. If we can provide more information, details or support let me know.
@violetagg I can confirm we are seeing the same thing in a production environment. For the most part the gateway works just fine, but sometimes it either leaks a connection or stops processing events entirely.
We usually find that this occurs when gateway is under moderate load (a few requests with very large payloads or many requests with small payloads). Our first instinct was to double check that we were not running into out-of-memory issues on the server side, but even when throwing half a terabyte of RAM at it (with a 200mb payload) no dice.
The strange part is that this affects the entire netty reactor. The whole webserver holds up until the issue with gateway has timed out, springing back to life like nothing happened. We then took a look at our connection counts using:
sudo netstat -tn | grep CLOSE_WAIT
but there are no connections alive. What we do see however, is that connections are left open permanently, even when the web browser has been closed.
sudo netstat -tnp | grep 10.40.1.35
tcp 0 0 172.16.164.87:22 10.40.1.35:34940 ESTABLISHED 14068/sshd:
tcp6 0 0 172.16.164.87:443 10.40.1.35:37210 ESTABLISHED 13593/java
tcp6 0 0 172.16.164.87:443 10.40.1.35:37220 ESTABLISHED 13593/java
These connections disappear roughly 1 minute after the entire PC has been shut down. We tested it this way because we were unsure if it was an OS-level issue holding sockets open.
Note that this is not a netty issue in and of itself (though it may involve netty). We can confirm this by doing the same set of tests but without spring gateway. Client connections are closed the moment we leave the page.
We can try and provide a TCPDUMP, but due to the nature of this being a production server in use by a customer, this may prove difficult to convince them of.
@heanssgen-troy we used https://scapy.net to remove "confidential" information from our dumps. Maybe this is an option for you if the dump is required.
Any update on this issue? I seem to be running into same issue. The gateway just hangs, and doesn't do anything. No information in logs either.
I think we‘re seeing the same problem. Spring Cloud Gateway keeps Sockets open in ESTABLISHED status. This can‘t be reproduced for every request. There must be something strange going on. Anything I can provide to help you track this down?
From gitter @JWThewes is using boot 2.3.3 and cloud Hoxton.SR8
ping @violetagg
I‘ve investigated this a bit more this morning. There muss be something strange going on based on some scheduling or external event.
I looked at /prod/
Here‘s the fd listing:
lrwx------. 1 <user> <group> 64 Sep 9 23:28 99 -> socket:[49404989] lrwx------. 1 <user> <group> 64 Sep 9 23:28 98 -> socket:[49125290] lrwx------. 1 <user> <group> 64 Sep 9 23:28 97 -> socket:[49127009] lrwx------. 1 <user> <group> 64 Sep 9 23:28 96 -> socket:[48677058] lrwx------. 1 <user> <group> 64 Sep 9 23:28 83 -> socket:[48676284] lrwx------. 1 <user> <group> 64 Sep 9 23:28 82 -> socket:[49916204] lrwx------. 1 <user> <group> 64 Sep 9 23:28 118 -> socket:[49258823] lrwx------. 1 <user> <group> 64 Sep 9 23:28 115 -> socket:[48846247] lrwx------. 1 <user> <group> 64 Sep 9 23:28 113 -> socket:[48846033] lrwx------. 1 <user> <group> 64 Sep 9 23:28 112 -> socket:[48676313] lrwx------. 1 <user> <group> 64 Sep 9 23:28 111 -> socket:[49405280] lrwx------. 1 <user> <group> 64 Sep 9 23:28 110 -> socket:[48718889] lrwx------. 1 <user> <group> 64 Sep 9 23:28 107 -> socket:[49125391] lrwx------. 1 <user> <group> 64 Sep 9 23:28 106 -> socket:[49404640] lrwx------. 1 <user> <group> 64 Sep 9 23:28 105 -> socket:[49404633] lrwx------. 1 <user> <group> 64 Sep 9 23:28 103 -> socket:[48909561] lrwx------. 1 <user> <group> 64 Sep 9 23:28 101 -> socket:[49403756] lrwx------. 1 <user> <group> 64 Sep 9 23:28 100 -> socket:[48844613] lrwx------. 1 <user> <group> 64 Sep 10 23:28 116 -> socket:[49126065] lrwx------. 1 <user> <group> 64 Sep 10 23:28 108 -> socket:[49829618] lrwx------. 1 <user> <group> 64 Sep 10 23:28 104 -> socket:[49404993] lrwx------. 1 <user> <group> 64 Sep 10 23:28 102 -> socket:[49683344] lrwx------. 1 <user> <group> 64 Sep 11 23:28 130 -> socket:[49724513] lrwx------. 1 <user> <group> 64 Sep 11 23:28 123 -> socket:[49302332] lrwx------. 1 <user> <group> 64 Sep 11 23:28 122 -> socket:[49303623] lrwx------. 1 <user> <group> 64 Sep 11 23:28 120 -> socket:[49220198] lrwx------. 1 <user> <group> 64 Sep 11 23:28 119 -> socket:[49683164] lrwx------. 1 <user> <group> 64 Sep 11 23:28 114 -> socket:[49257332] lrwx------. 1 <user> <group> 64 Sep 11 23:28 109 -> socket:[49220194] lrwx------. 1 <user> <group> 64 Sep 12 23:28 131 -> socket:[49652091] lrwx------. 1 <user> <group> 64 Sep 12 23:28 129 -> socket:[49683171] lrwx------. 1 <user> <group> 64 Sep 12 23:28 128 -> socket:[49701174] lrwx------. 1 <user> <group> 64 Sep 12 23:28 127 -> socket:[49651139] lrwx------. 1 <user> <group> 64 Sep 12 23:28 126 -> socket:[49717647] lrwx------. 1 <user> <group> 64 Sep 12 23:28 125 -> socket:[49591854] lrwx------. 1 <user> <group> 64 Sep 12 23:28 124 -> socket:[49683338] lrwx------. 1 <user> <group> 64 Sep 12 23:28 121 -> socket:[49603347] lrwx------. 1 <user> <group> 64 Sep 12 23:28 117 -> socket:[49580377] lrwx------. 1 <user> <group> 64 Sep 13 23:28 180 -> socket:[50033918] lrwx------. 1 <user> <group> 64 Sep 13 23:28 179 -> socket:[50048067] lrwx------. 1 <user> <group> 64 Sep 13 23:28 178 -> socket:[49931997] lrwx------. 1 <user> <group> 64 Sep 13 23:28 177 -> socket:[50046129] lrwx------. 1 <user> <group> 64 Sep 13 23:28 176 -> socket:[49970964] lrwx------. 1 <user> <group> 64 Sep 13 23:28 175 -> socket:[50046013] lrwx------. 1 <user> <group> 64 Sep 13 23:28 174 -> socket:[49972486] lrwx------. 1 <user> <group> 64 Sep 13 23:28 173 -> socket:[50019974] lrwx------. 1 <user> <group> 64 Sep 13 23:28 172 -> socket:[49979536] lrwx------. 1 <user> <group> 64 Sep 13 23:28 171 -> socket:[50066274] lrwx------. 1 <user> <group> 64 Sep 13 23:28 170 -> socket:[50074144] lrwx------. 1 <user> <group> 64 Sep 13 23:28 169 -> socket:[49975727] lrwx------. 1 <user> <group> 64 Sep 13 23:28 168 -> socket:[50016708] lrwx------. 1 <user> <group> 64 Sep 13 23:28 167 -> socket:[50046191] lrwx------. 1 <user> <group> 64 Sep 13 23:28 165 -> socket:[50074074] lrwx------. 1 <user> <group> 64 Sep 13 23:28 164 -> socket:[49961726] lrwx------. 1 <user> <group> 64 Sep 13 23:28 163 -> socket:[49962441] lrwx------. 1 <user> <group> 64 Sep 13 23:28 162 -> socket:[49878500] lrwx------. 1 <user> <group> 64 Sep 13 23:28 161 -> socket:[49878412] lrwx------. 1 <user> <group> 64 Sep 13 23:28 160 -> socket:[49856328] lrwx------. 1 <user> <group> 64 Sep 13 23:28 159 -> socket:[49889807] lrwx------. 1 <user> <group> 64 Sep 13 23:28 158 -> socket:[49876274] lrwx------. 1 <user> <group> 64 Sep 13 23:28 157 -> socket:[49856324] lrwx------. 1 <user> <group> 64 Sep 13 23:28 156 -> socket:[49892074] lrwx------. 1 <user> <group> 64 Sep 13 23:28 155 -> socket:[49857014] lrwx------. 1 <user> <group> 64 Sep 13 23:28 154 -> socket:[49962585] lrwx------. 1 <user> <group> 64 Sep 13 23:28 153 -> socket:[49853596] lrwx------. 1 <user> <group> 64 Sep 13 23:28 152 -> socket:[49962651] lrwx------. 1 <user> <group> 64 Sep 13 23:28 151 -> socket:[49876354] lrwx------. 1 <user> <group> 64 Sep 13 23:28 150 -> socket:[49824735] lrwx------. 1 <user> <group> 64 Sep 13 23:28 149 -> socket:[49762955] lrwx------. 1 <user> <group> 64 Sep 13 23:28 148 -> socket:[49844786] lrwx------. 1 <user> <group> 64 Sep 13 23:28 147 -> socket:[49874843] lrwx------. 1 <user> <group> 64 Sep 13 23:28 146 -> socket:[49853988] lrwx------. 1 <user> <group> 64 Sep 13 23:28 145 -> socket:[49825617] lrwx------. 1 <user> <group> 64 Sep 13 23:28 144 -> socket:[49786037] lrwx------. 1 <user> <group> 64 Sep 13 23:28 143 -> socket:[49826096] lrwx------. 1 <user> <group> 64 Sep 13 23:28 142 -> socket:[49686044] lrwx------. 1 <user> <group> 64 Sep 13 23:28 141 -> socket:[49774893] lrwx------. 1 <user> <group> 64 Sep 13 23:28 140 -> socket:[49735518] lrwx------. 1 <user> <group> 64 Sep 13 23:28 139 -> socket:[49758272] lrwx------. 1 <user> <group> 64 Sep 13 23:28 138 -> socket:[49686011] lrwx------. 1 <user> <group> 64 Sep 13 23:28 137 -> socket:[49755226] lrwx------. 1 <user> <group> 64 Sep 13 23:28 136 -> socket:[49743994] lrwx------. 1 <user> <group> 64 Sep 13 23:28 135 -> socket:[49683386] lrwx------. 1 <user> <group> 64 Sep 13 23:28 134 -> socket:[49685992] lrwx------. 1 <user> <group> 64 Sep 13 23:28 133 -> socket:[49727272] lrwx------. 1 <user> <group> 64 Sep 13 23:28 132 -> socket:[49755862]
Is there anything I can provide to track this down?
This seems to be related to the Downstream service returning a non 200 statuscode. I‘ll try to create a reproducer for this. This is getting really urgent for us...
This seems to be related to the Downstream service returning a non 200 statuscode. I‘ll try to create a reproducer for this. This is getting really urgent for us...
@JWThewes If you are able to create a reproducible example it will help a lot
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.
I think it's way too easy to close this one. This is a serious problem I'm working on a reproducer but it takes some time.
@JWThewes don't worry, take your time. Let us know when you have something.
Any update on this issue? I seem to be running into same issue. The gateway just hangs, and doesn't do anything. No information in logs either.
Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.
Hello,
we try to integrate spring cloud gateway and encounter a strange behavior. A more or less bare / fresh project with a simple route configuration stops sometimes working (the particular Pipeline) wenn Spring cloud gateway pipelines requests to a downstream service (we could reproduce this behavior with https req. only).
Spring Version: 2.1.2.RELEASE Spring cloud: Greenwich.RELEASE
The Service "pom.xml":
The configuration:
The configured service is a spring boot service (https) with a simple hello world interface. But it dosen't matter i tried different https secured pages with the same result.
When i use the following cmd to get a page from the downstream service sometimes it stops.
cmd line looks like:
we try to trace this behavior with a tcpdump. since this ist encrypted traffic we can't decompile it. But it looks like the service sends it's answer but the getway thinks there is a chunk missing.