spring-cloud / spring-cloud-gateway

An API Gateway built on Spring Framework and Spring Boot providing routing and more.
http://cloud.spring.io
Apache License 2.0
4.55k stars 3.33k forks source link

HTTP Pipelining failed #815

Closed lusche closed 3 years ago

lusche commented 5 years ago

Hello,

we try to integrate spring cloud gateway and encounter a strange behavior. A more or less bare / fresh project with a simple route configuration stops sometimes working (the particular Pipeline) wenn Spring cloud gateway pipelines requests to a downstream service (we could reproduce this behavior with https req. only).

Spring Version: 2.1.2.RELEASE Spring cloud: Greenwich.RELEASE

The Service "pom.xml":

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.1.2.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.example</groupId>
    <artifactId>charon</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>charon</name>
    <description>Demo project for Spring Boot</description>

    <properties>
        <java.version>11</java.version>
        <spring-cloud.version>Greenwich.RELEASE</spring-cloud.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-starter-gateway</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.springframework.cloud</groupId>
                <artifactId>spring-cloud-dependencies</artifactId>
                <version>${spring-cloud.version}</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>

    <repositories>
        <repository>
            <id>spring-milestones</id>
            <name>Spring Milestones</name>
            <url>https://repo.spring.io/milestone</url>
        </repository>
    </repositories>

</project>

The configuration:

spring:
  cloud:
    gateway:
      httpclient:
        ssl:
          useInsecureTrustManager: true
      routes:
        - id: example
          uri: https://127.0.0.1:31008
          predicates:
            - Path=/baa/**
          filters:
            - RewritePath=/baa/(?<segment>.*), /$\{segment}

The configured service is a spring boot service (https) with a simple hello world interface. But it dosen't matter i tried different https secured pages with the same result.

When i use the following cmd to get a page from the downstream service sometimes it stops.

 curl \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \
 http://127.0.0.1:8080/baa/hello \ 

cmd line looks like:

bash-3.2$ bash X
HelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHelloHello
bash-3.2$ bash X
HelloHelloHelloHelloHelloHelloHelloHello

we try to trace this behavior with a tcpdump. since this ist encrypted traffic we can't decompile it. But it looks like the service sends it's answer but the getway thinks there is a chunk missing.

spencergibb commented 5 years ago

I'm not sure I understand the results. Can you eliminate somethings? java 11 and curl?

reactor-netty has some issues with curl. @smaldini or @violetagg?

lusche commented 5 years ago

I made a quick check with java 8 this morning. No difference. We also test this in different environments with very similar results.

Env I: Mac OS Mojave all service listen on localhost

Env II: Oracle Enterprise Linux 7.X services running on different Servers.

I try to use ab (apache benchmark) and i can recreate the result.

Result first run:

ab -c 1 -n 25 http://127.0.0.1:8080/baa/hello                                                        
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient).....done

Server Software:
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /baa/hello
Document Length:        5 bytes

Concurrency Level:      1
Time taken for tests:   0.258 seconds
Complete requests:      25
Failed requests:        0
Total transferred:      9650 bytes
HTML transferred:       125 bytes
Requests per second:    97.03 [#/sec] (mean)
Time per request:       10.306 [ms] (mean)
Time per request:       10.306 [ms] (mean, across all concurrent requests)
Transfer rate:          36.58 [Kbytes/sec] received

Result second run:

 ab -c 1 -n 25 http://127.0.0.1:8080/baa/hello                                                      
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 10 requests completed
lusche commented 5 years ago

i tried gatling to reproduce this problem. i can't reproduce it locally but in my second Environment (OEL and 2 servers) it looks like i run in the same issue.

import scala.concurrent.duration._

import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.jdbc.Predef._

class RecordedSimulation extends Simulation {

    val httpProtocol = http
        .baseURL("http://127.0.0.1:8080")
        .inferHtmlResources()
        .acceptHeader("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
        .acceptEncodingHeader("gzip, deflate")
        .acceptLanguageHeader("de,en-US;q=0.7,en;q=0.3")
        .userAgentHeader("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0")

    val headers_0 = Map("Upgrade-Insecure-Requests" -> "1")

    val scn = scenario("RecordedSimulation")
        .exec(http("request_0")
            .get("/baa/hello")
            .headers(headers_0))

    setUp(scn.inject(
        constantUsersPerSec(1) during(10 seconds),
     nothingFor(1 seconds),
        constantUsersPerSec(1) during(10 seconds),
     nothingFor(3 seconds),
        constantUsersPerSec(1) during(10 seconds),
     nothingFor(5 seconds),
        constantUsersPerSec(1) during(10 seconds),
     nothingFor(7 seconds),
        constantUsersPerSec(1) during(10 seconds),
     nothingFor(9 seconds),
        constantUsersPerSec(1) during(10 seconds),
     nothingFor(11 seconds),
        constantUsersPerSec(1) during(10 seconds)
    )).protocols(httpProtocol)
}
spencergibb commented 5 years ago

@violetagg does this ring a bell?

violetagg commented 5 years ago

@lusche I'm not able to reproduce the described behaviour. I receive always all responses.

lusche commented 5 years ago

@violetagg what can i do to enable you to reproduce this?

violetagg commented 5 years ago

@lusche if you can provide tcpdump?

lusche commented 5 years ago

@violetagg sure, attached you find two tcpdumps. the "short" one i stopped after a missing response. the long one i let tcpdum an the req. running. You can see that in this one the server runs in an timeout. closes the connection and statrs a new one. We removes some certificate information from the dump and changed the IP's. Hope this helps. If we can provide more information, details or support let me know.

long.pcap.zip short.pcap.zip

heanssgen-troy commented 5 years ago

@violetagg I can confirm we are seeing the same thing in a production environment. For the most part the gateway works just fine, but sometimes it either leaks a connection or stops processing events entirely.

We usually find that this occurs when gateway is under moderate load (a few requests with very large payloads or many requests with small payloads). Our first instinct was to double check that we were not running into out-of-memory issues on the server side, but even when throwing half a terabyte of RAM at it (with a 200mb payload) no dice.

The strange part is that this affects the entire netty reactor. The whole webserver holds up until the issue with gateway has timed out, springing back to life like nothing happened. We then took a look at our connection counts using:

sudo netstat -tn | grep CLOSE_WAIT

but there are no connections alive. What we do see however, is that connections are left open permanently, even when the web browser has been closed.

sudo netstat -tnp | grep 10.40.1.35
tcp        0      0 172.16.164.87:22        10.40.1.35:34940        ESTABLISHED 14068/sshd: 
tcp6       0      0 172.16.164.87:443       10.40.1.35:37210        ESTABLISHED 13593/java      
tcp6       0      0 172.16.164.87:443       10.40.1.35:37220        ESTABLISHED 13593/java

These connections disappear roughly 1 minute after the entire PC has been shut down. We tested it this way because we were unsure if it was an OS-level issue holding sockets open.

Note that this is not a netty issue in and of itself (though it may involve netty). We can confirm this by doing the same set of tests but without spring gateway. Client connections are closed the moment we leave the page.

We can try and provide a TCPDUMP, but due to the nature of this being a production server in use by a customer, this may prove difficult to convince them of.

lusche commented 5 years ago

@heanssgen-troy we used https://scapy.net to remove "confidential" information from our dumps. Maybe this is an option for you if the dump is required.

mubinahmed commented 5 years ago

Any update on this issue? I seem to be running into same issue. The gateway just hangs, and doesn't do anything. No information in logs either.

JWThewes commented 4 years ago

I think we‘re seeing the same problem. Spring Cloud Gateway keeps Sockets open in ESTABLISHED status. This can‘t be reproduced for every request. There must be something strange going on. Anything I can provide to help you track this down?

spencergibb commented 4 years ago

From gitter @JWThewes is using boot 2.3.3 and cloud Hoxton.SR8

ping @violetagg

JWThewes commented 4 years ago

I‘ve investigated this a bit more this morning. There muss be something strange going on based on some scheduling or external event.

I looked at /prod//fd and their creation date. New sockets are always created at 23:28 on this machine. On 2020-09-13 47 new sockets have been created at 23:28. On other machines it looks the same BUT doesn‘t occure on the same point in time.

Here‘s the fd listing: lrwx------. 1 <user> <group> 64 Sep 9 23:28 99 -> socket:[49404989] lrwx------. 1 <user> <group> 64 Sep 9 23:28 98 -> socket:[49125290] lrwx------. 1 <user> <group> 64 Sep 9 23:28 97 -> socket:[49127009] lrwx------. 1 <user> <group> 64 Sep 9 23:28 96 -> socket:[48677058] lrwx------. 1 <user> <group> 64 Sep 9 23:28 83 -> socket:[48676284] lrwx------. 1 <user> <group> 64 Sep 9 23:28 82 -> socket:[49916204] lrwx------. 1 <user> <group> 64 Sep 9 23:28 118 -> socket:[49258823] lrwx------. 1 <user> <group> 64 Sep 9 23:28 115 -> socket:[48846247] lrwx------. 1 <user> <group> 64 Sep 9 23:28 113 -> socket:[48846033] lrwx------. 1 <user> <group> 64 Sep 9 23:28 112 -> socket:[48676313] lrwx------. 1 <user> <group> 64 Sep 9 23:28 111 -> socket:[49405280] lrwx------. 1 <user> <group> 64 Sep 9 23:28 110 -> socket:[48718889] lrwx------. 1 <user> <group> 64 Sep 9 23:28 107 -> socket:[49125391] lrwx------. 1 <user> <group> 64 Sep 9 23:28 106 -> socket:[49404640] lrwx------. 1 <user> <group> 64 Sep 9 23:28 105 -> socket:[49404633] lrwx------. 1 <user> <group> 64 Sep 9 23:28 103 -> socket:[48909561] lrwx------. 1 <user> <group> 64 Sep 9 23:28 101 -> socket:[49403756] lrwx------. 1 <user> <group> 64 Sep 9 23:28 100 -> socket:[48844613] lrwx------. 1 <user> <group> 64 Sep 10 23:28 116 -> socket:[49126065] lrwx------. 1 <user> <group> 64 Sep 10 23:28 108 -> socket:[49829618] lrwx------. 1 <user> <group> 64 Sep 10 23:28 104 -> socket:[49404993] lrwx------. 1 <user> <group> 64 Sep 10 23:28 102 -> socket:[49683344] lrwx------. 1 <user> <group> 64 Sep 11 23:28 130 -> socket:[49724513] lrwx------. 1 <user> <group> 64 Sep 11 23:28 123 -> socket:[49302332] lrwx------. 1 <user> <group> 64 Sep 11 23:28 122 -> socket:[49303623] lrwx------. 1 <user> <group> 64 Sep 11 23:28 120 -> socket:[49220198] lrwx------. 1 <user> <group> 64 Sep 11 23:28 119 -> socket:[49683164] lrwx------. 1 <user> <group> 64 Sep 11 23:28 114 -> socket:[49257332] lrwx------. 1 <user> <group> 64 Sep 11 23:28 109 -> socket:[49220194] lrwx------. 1 <user> <group> 64 Sep 12 23:28 131 -> socket:[49652091] lrwx------. 1 <user> <group> 64 Sep 12 23:28 129 -> socket:[49683171] lrwx------. 1 <user> <group> 64 Sep 12 23:28 128 -> socket:[49701174] lrwx------. 1 <user> <group> 64 Sep 12 23:28 127 -> socket:[49651139] lrwx------. 1 <user> <group> 64 Sep 12 23:28 126 -> socket:[49717647] lrwx------. 1 <user> <group> 64 Sep 12 23:28 125 -> socket:[49591854] lrwx------. 1 <user> <group> 64 Sep 12 23:28 124 -> socket:[49683338] lrwx------. 1 <user> <group> 64 Sep 12 23:28 121 -> socket:[49603347] lrwx------. 1 <user> <group> 64 Sep 12 23:28 117 -> socket:[49580377] lrwx------. 1 <user> <group> 64 Sep 13 23:28 180 -> socket:[50033918] lrwx------. 1 <user> <group> 64 Sep 13 23:28 179 -> socket:[50048067] lrwx------. 1 <user> <group> 64 Sep 13 23:28 178 -> socket:[49931997] lrwx------. 1 <user> <group> 64 Sep 13 23:28 177 -> socket:[50046129] lrwx------. 1 <user> <group> 64 Sep 13 23:28 176 -> socket:[49970964] lrwx------. 1 <user> <group> 64 Sep 13 23:28 175 -> socket:[50046013] lrwx------. 1 <user> <group> 64 Sep 13 23:28 174 -> socket:[49972486] lrwx------. 1 <user> <group> 64 Sep 13 23:28 173 -> socket:[50019974] lrwx------. 1 <user> <group> 64 Sep 13 23:28 172 -> socket:[49979536] lrwx------. 1 <user> <group> 64 Sep 13 23:28 171 -> socket:[50066274] lrwx------. 1 <user> <group> 64 Sep 13 23:28 170 -> socket:[50074144] lrwx------. 1 <user> <group> 64 Sep 13 23:28 169 -> socket:[49975727] lrwx------. 1 <user> <group> 64 Sep 13 23:28 168 -> socket:[50016708] lrwx------. 1 <user> <group> 64 Sep 13 23:28 167 -> socket:[50046191] lrwx------. 1 <user> <group> 64 Sep 13 23:28 165 -> socket:[50074074] lrwx------. 1 <user> <group> 64 Sep 13 23:28 164 -> socket:[49961726] lrwx------. 1 <user> <group> 64 Sep 13 23:28 163 -> socket:[49962441] lrwx------. 1 <user> <group> 64 Sep 13 23:28 162 -> socket:[49878500] lrwx------. 1 <user> <group> 64 Sep 13 23:28 161 -> socket:[49878412] lrwx------. 1 <user> <group> 64 Sep 13 23:28 160 -> socket:[49856328] lrwx------. 1 <user> <group> 64 Sep 13 23:28 159 -> socket:[49889807] lrwx------. 1 <user> <group> 64 Sep 13 23:28 158 -> socket:[49876274] lrwx------. 1 <user> <group> 64 Sep 13 23:28 157 -> socket:[49856324] lrwx------. 1 <user> <group> 64 Sep 13 23:28 156 -> socket:[49892074] lrwx------. 1 <user> <group> 64 Sep 13 23:28 155 -> socket:[49857014] lrwx------. 1 <user> <group> 64 Sep 13 23:28 154 -> socket:[49962585] lrwx------. 1 <user> <group> 64 Sep 13 23:28 153 -> socket:[49853596] lrwx------. 1 <user> <group> 64 Sep 13 23:28 152 -> socket:[49962651] lrwx------. 1 <user> <group> 64 Sep 13 23:28 151 -> socket:[49876354] lrwx------. 1 <user> <group> 64 Sep 13 23:28 150 -> socket:[49824735] lrwx------. 1 <user> <group> 64 Sep 13 23:28 149 -> socket:[49762955] lrwx------. 1 <user> <group> 64 Sep 13 23:28 148 -> socket:[49844786] lrwx------. 1 <user> <group> 64 Sep 13 23:28 147 -> socket:[49874843] lrwx------. 1 <user> <group> 64 Sep 13 23:28 146 -> socket:[49853988] lrwx------. 1 <user> <group> 64 Sep 13 23:28 145 -> socket:[49825617] lrwx------. 1 <user> <group> 64 Sep 13 23:28 144 -> socket:[49786037] lrwx------. 1 <user> <group> 64 Sep 13 23:28 143 -> socket:[49826096] lrwx------. 1 <user> <group> 64 Sep 13 23:28 142 -> socket:[49686044] lrwx------. 1 <user> <group> 64 Sep 13 23:28 141 -> socket:[49774893] lrwx------. 1 <user> <group> 64 Sep 13 23:28 140 -> socket:[49735518] lrwx------. 1 <user> <group> 64 Sep 13 23:28 139 -> socket:[49758272] lrwx------. 1 <user> <group> 64 Sep 13 23:28 138 -> socket:[49686011] lrwx------. 1 <user> <group> 64 Sep 13 23:28 137 -> socket:[49755226] lrwx------. 1 <user> <group> 64 Sep 13 23:28 136 -> socket:[49743994] lrwx------. 1 <user> <group> 64 Sep 13 23:28 135 -> socket:[49683386] lrwx------. 1 <user> <group> 64 Sep 13 23:28 134 -> socket:[49685992] lrwx------. 1 <user> <group> 64 Sep 13 23:28 133 -> socket:[49727272] lrwx------. 1 <user> <group> 64 Sep 13 23:28 132 -> socket:[49755862]

JWThewes commented 4 years ago

Is there anything I can provide to track this down?

JWThewes commented 4 years ago

This seems to be related to the Downstream service returning a non 200 statuscode. I‘ll try to create a reproducer for this. This is getting really urgent for us...

violetagg commented 4 years ago

This seems to be related to the Downstream service returning a non 200 statuscode. I‘ll try to create a reproducer for this. This is getting really urgent for us...

@JWThewes If you are able to create a reproducible example it will help a lot

spring-projects-issues commented 4 years ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

JWThewes commented 4 years ago

I think it's way too easy to close this one. This is a serious problem I'm working on a reproducer but it takes some time.

spencergibb commented 4 years ago

@JWThewes don't worry, take your time. Let us know when you have something.

zhaowei12345 commented 3 years ago

Any update on this issue? I seem to be running into same issue. The gateway just hangs, and doesn't do anything. No information in logs either.

spencergibb commented 3 years ago

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.