motiv-labs / janus

An API Gateway written in Go
https://hellofresh.gitbooks.io/janus
MIT License
2.79k stars 317 forks source link

Is there a way to increase request ttl for api definition ? Got http: proxy error: context canceled #320

Closed jzyinq closed 6 years ago

jzyinq commented 6 years ago

Is there a way to increase request ttl ?

Something like upstream_[connect|send|read]_timeout ?

Reproduction Steps:

  1. Make a request through Janus that will last longer than 50 seconds.

Expected behavior:

Requests is handled properly in his own time.

Observed behavior:

Request is terminated with Janus error:

http: proxy error: context canceled

Janus version: 3.7.0-rc.14 Docker image

jzyinq commented 6 years ago

@italolelis could you refer to this ?

italolelis commented 6 years ago

Hi @jzyinq,

Thanks for reporting this. At the moment we don't have this configuration per proxy. But we have a task to add a plugin that you can configure the timeout for each endpoint. We should have this functionality soon.

jzyinq commented 6 years ago

Thanks for reply @italolelis,

Sorry for bothering but how soon is soon ? ;)

Could you guide me then how to set this globally for now ?

rafaeljesus commented 6 years ago

@jzyinq in fact we are going to add this plugin to actually decrease the current time from 60s to 30s at most, in very exceptional cases a request should take that much, if the downstream service does not respond quickly the proxy should just kill the connection. What's your use case?

jzyinq commented 6 years ago

@rafaeljesus We got some requests that streams very big payloads of data and that takes about 60-120 seconds at the moment. So in our case response is quick but reading it takes more time and it looks like Janus is waiting for whole payload before letting it outside.

Do you have any thoughts how we can tackle this problem ?

italolelis commented 6 years ago

@jzyinq I see. Janus should have the capabilities for you to specify a timeout per endpoint. I'll try to implement this asap.

jzyinq commented 6 years ago

Great! Looking forward to hear about this feature soon. Thanks for your insight @italolelis.

quba commented 6 years ago

@italolelis is there any eta? We have a pending feature that is blocked because of this issue and before the next product release we have to decide whether we go with Janus or Kong. That's the only blocker for us currently. Thanks! :)

italolelis commented 6 years ago

@quba I will open a PR tonight with this feature. There is one integration test failing that I hope to fix it today or tomorrow.

quba commented 6 years ago

Wow, that's fast. 👍

italolelis commented 6 years ago

@quba The new version with the timeouts is out, please check if it attends your needs

quba commented 6 years ago

Great! We should be able to provide some feedback within next 2 weeks.

glothriel commented 6 years ago

Hi @italolelis It looks like unmarshaler has some problems with the way you formatted the settings, after setting {"forwarding_timeouts": {"dial_timeout": "30s", "response_header_timeout": "30s"},}, I receive the following error in the console:

[RPC] --> Couldn't unmarshal api configuration  error="json: cannot unmarshal string into Go struct field ForwardingTimeouts.dial_timeout of type time.Duration"

After changing those values to integers, janus starts, but responds to all requests with 504, saying

http: proxy error: dial tcp: i/o timeout

In the logs

vgarvardt commented 6 years ago

@glothriel thx for reporting! I'll take a look at this issue

vgarvardt commented 6 years ago

@glothriel issue should be fixed in https://github.com/hellofresh/janus/pull/327, please check - integer value (actually int64) are nanoseconds, so 30ns is too few to process any request. I fixed unmarshalling from string for time.Duration

glothriel commented 6 years ago

@vgarvardt Thanks for quick fix and sorry for the delay.

My configuration looks like this:

       ...
       "proxy": {
           "forwarding_timeouts": {
              "dial_timeout": "30s", 
              "response_header_timeout": "30s"
           },
       }
      ...

Unless I made some configuration error, the feature does not work. Janus starts, but the above setting does not make any effect on the application. Despite having set 30s timeout, it allows much longer requests (the longest I did was 3 minutes, I did not test longer ones). It looks like some global timeout was also removed, because even after removing whole forwarding_timeouts from the configuration, I'm able to perform those long 3-minute requests. The version we previously used is 3.7.0 (we got timeouts after 50s there), maybe something changed with global timeout in the meantime.

glothriel commented 6 years ago

@vgarvardt Can you respond to my previous comment please?

vgarvardt commented 6 years ago

@glothriel I'll try to check this and fix during the next week

vgarvardt commented 6 years ago

@glothriel do you use file or mongo-based proxies storage? During investigation I found an issue with Duration unmarshalling from BSON - fixed in https://github.com/hellofresh/janus/pull/334.

glothriel commented 6 years ago

We use file storage.

vgarvardt commented 6 years ago

I created simple endpoint

    r.GET("/sleep", func(w http.ResponseWriter, r *http.Request) {
        seconds, _ := strconv.Atoi(r.URL.Query().Get("seconds"))
        log.WithField("seconds", seconds).Info("Starting sleep handler")
        time.Sleep(time.Duration(seconds) * time.Second)
        w.Write([]byte("Hello sleep\n"))
    })

and set up proxy to it with the following parameters:

{
   "proxy" : {
       "listen_path": "/sleep",
       "forwarding_timeouts" : {
            "dial_timeout" : "5s",
            "response_header_timeout" : "17s"
        },
    },
}

then called the endpoint with curl (janus runs on port 3000 in my local dev env)

$ time curl -v localhost:3000/sleep?seconds=30
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 3000 (#0)
> GET /sleep?seconds=30 HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 502 Bad Gateway
< Vary: Origin
< X-Request-Id: 68679848-4399-492d-84e4-67205ccb11af
< Date: Fri, 13 Jul 2018 13:31:36 GMT
< Content-Length: 0
<
* Connection #0 to host localhost left intact

real    0m17.028s
user    0m0.008s
sys 0m0.006s

In janus logs I got the following:

{"level":"debug","method":"GET","msg":"Started request","path":"/sleep","time":"2018-07-13T15:30:41+02:00"}
2018/07/13 15:30:58 http: proxy error: net/http: timeout awaiting response headers
{"code":502,"duration":17007,"duration-fmt":"17.007905357s","host":"localhost:3000","level":"info","method":"GET","msg":"Completed handling request","referer":"","remote-addr":"[::1]:64679","request":"/sleep?seconds=30","request-id":"7a5c8dee-931d-4d99-8f9d-dd71769c4d0c","time":"2018-07-13T15:30:58+02:00","user-agent":"curl/7.54.0"}

That means transport broke connection, so client request was finished in 17s and the response code was 502, although.

Then I increased timeouts to the following values:

        "forwarding_timeouts" : {
            "dial_timeout" : "5s",
            "response_header_timeout" : "180s"
        }

curl:

$ time curl -v localhost:3000/sleep?seconds=300
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 3000 (#0)
> GET /sleep?seconds=300 HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 502 Bad Gateway
< Vary: Origin
< X-Request-Id: 2296f1ea-6103-43ff-bb1f-43c14be42dfd
< Date: Fri, 13 Jul 2018 13:47:53 GMT
< Content-Length: 0
<
* Connection #0 to host localhost left intact

real    3m0.028s
user    0m0.011s
sys 0m0.015s

logs:

{"level":"debug","method":"GET","msg":"Started request","path":"/sleep","time":"2018-07-13T15:44:53+02:00"}
2018/07/13 15:47:53 http: proxy error: net/http: timeout awaiting response headers
{"code":502,"duration":180006,"duration-fmt":"3m0.006664055s","host":"localhost:3000","level":"info","method":"GET","msg":"Completed handling request","referer":"","remote-addr":"[::1]:64743","request":"/sleep?seconds=300","request-id":"2296f1ea-6103-43ff-bb1f-43c14be42dfd","time":"2018-07-13T15:47:53+02:00","user-agent":"curl/7.54.0"}

So, timeout feature works for these test cases. Can you provide more information for your test case so I could reproduce it to find the root cause for the issue you're having and fix it?

glothriel commented 6 years ago

Thanks @vgarvardt, It appears that behavior I was facing was related to my testing endpoint. We still have an issue with randomly encountered NullPointerDereference exceptions from cobra on startup, but this will be followed by @jzyinq.

vgarvardt commented 6 years ago

I'm closing the issue then. Feel free to open new one if you find anything else. Thank you!