motiv-labs / janus

An API Gateway written in Go
https://hellofresh.gitbooks.io/janus
MIT License
2.8k stars 322 forks source link

Problem with Janus admin port sporadically not responding on clean start #474

Open brentgriffin opened 3 years ago

brentgriffin commented 3 years ago

Running Janus with basic auth using cassandra as the persistence mechanism: Automated scripted deployment of Janus sporadically comes up in a bad state. This bad state is that connections to the admin port are accepted but they block until the client times out (no response is ever sent to the client). Requests through the api gateway port seem to be working properly.

When the system comes up in this state, it never recovers. The only way that I can get it working is to undeploy Janus and to redeploy it.

Not having the admin port available prevents the loading of basic user credentials.

Frequency: No hard numbers here but estimating it fails once every five to six deployments.

Possible cause: Looking at the logs, I see a timeout on accessing cassandra. Does not appear to ever retry the cassandra request.

Janus log when in bad state:

➜ kubectl logs janus-deployment-6bfccd676-v7qgd -c janus
time="2021-04-08T14:09:44Z" level=info msg="Janus starting..." version=dev-9fa15f6
[StatsGo] 2021/04/08 14:09:44 Stats counter incremented metric=app.init.janus-deployment-6bfccd676-v7qgd.janus
[StatsGo] 2021/04/08 14:09:44 Stats counter incremented metric=total.app
[StatsGo] 2021/04/08 14:09:52 Stats counter incremented metric=error-log.error.-.-
[StatsGo] 2021/04/08 14:09:52 Stats counter incremented metric=total.error-log
{"level":"error","msg":"error getting all definitions: gocql: no response received from cassandra within timeout period","time":"2021-04-08T14:09:52Z"}

Janus log when the admin port works correctly:

➜ kubectl logs janus-deployment-6bfccd676-qssw9 -c janus
time="2021-04-08T14:57:24Z" level=info msg="Janus starting..." version=dev-9fa15f6
[StatsGo] 2021/04/08 14:57:24 Stats counter incremented metric=app.init.janus-deployment-6bfccd676-qssw9.janus
[StatsGo] 2021/04/08 14:57:24 Stats counter incremented metric=total.app
jtesser commented 3 years ago

yea I was thinking retry logic on cassandra @tuxranger

brentgriffin commented 3 years ago

for whatever reason, this has happened to me three times already today :-(

tuxranger commented 3 years ago

in the janus.toml could you change the logging level from info to debug. I don't believe the info level gives warning messages which is what the retry logic messages are. I would like to double check if the logic is even running.