spring-cloud / spring-cloud-netflix

Integration with Netflix OSS components
http://cloud.spring.io/spring-cloud-netflix/
Apache License 2.0
4.87k stars 2.44k forks source link

Turbine not aggregating as expected #1734

Closed jwconway closed 7 years ago

jwconway commented 7 years ago

I have a relatively simple turbine setup whereby i have a single app running in a cluster and a turbine node polling eureka for the streams that it needs to aggregate.

Im running a constant load through my app and when i scale it back to a single instance turbine and my load testing tool agree more or less:

image image

When I scale my app to two instances turbine reports 2 hosts but seems to halve the throughput metrics: image image

My turbine configuration is as follows:

server.port=8081
management.port=8082

spring.application.name=turbine

eureka.client.serviceUrl.defaultZone=http://[XXXXX]:8761/discovery/eureka/
eureka.client.registerWithEureka=false
eureka.client.fetchRegistry=true

turbine.aggregator.clusterConfig=SERVICE1
turbine.appConfig=service1

endpoints.enabled=false
endpoints.health.enabled=true
endpoints.refresh.enabled=true

Maybe im missing some important config, but ive read the documentation and cant see anything im missing

jwconway commented 7 years ago

It seems the hystrix dashboard is calculating rate per second using the propertyValue_metricsRollingStatisticalWindowInMilliseconds field to get the window of time in which it needs to work (https://github.com/Netflix/Hystrix/blob/master/hystrix-dashboard/src/main/webapp/components/hystrixCommand/hystrixCommand.js#L136). For our service turbine is summing this field. Does anyone have any idea why?

spencergibb commented 7 years ago

There was just a conversation about odd turbine results here https://gitter.im/spring-cloud/spring-cloud?at=58b6fb68f1a33b6275676ed9

jwconway commented 7 years ago

@spencergibb thanks. But I'm not using turbine stream. I'm just using plain turbine with eureka to discover the hystrix streams.

My problem was caused by a couple of things. Firstly Zuul was deployed in containers to ECS. What was happening here was they were being registered in eureka with the same ipaddress when the prefer-ip was set to true. I set this setting to false and gave my services a host name of {containerId}@{hostipaddress}. This way turbine recognises them as different services and was still able to request the stream on the same IP address.

Secondly the issue was mentioned above. I solved this by intercepting and serving my own hystrixCommand.js that didn't use the aggregated propertyValue_metricsRollingStatisticalWindowInMilliseconds property but a hard coded value that was known and kept in sync with Zuul

A bit "hacky" but job done

ryanjbaxter commented 7 years ago

So the solution ended up being a change in hystrixCommand.js?

jwconway commented 7 years ago

@ryanjbaxter so turbine is summing the propertyValue_metricsRollingStatisticalWindowInMilliseconds property.

This makes the routine that calculates rate calculate the rate incorrectly - https://github.com/Netflix/Hystrix/blob/master/hystrix-dashboard/src/main/webapp/components/hystrixCommand/hystrixCommand.js#L137

var numberSeconds = data["propertyValue_metricsRollingStatisticalWindowInMilliseconds"] / 1000;

When turbine is aggregating 2 hystix.streams the value for propertyValue_metricsRollingStatisticalWindowInMilliseconds is 20000 meaning the number of seconds calculated is 20 which is incorrect, it should be 10.

Our local fix is to change the above line to

var numberSeconds = (data["propertyValue_metricsRollingStatisticalWindowInMilliseconds"] / data["reportingHosts"]) / 1000;

To be able to make this change we routed hystrix dashboard requests via Zuul and used a static response filter to serve our amended hystrixCommand.js

it looks like this issue was fixed in the newer branch of turbine (https://github.com/Netflix/Turbine/blob/2.x/turbine-core/src/main/java/com/netflix/turbine/aggregator/StreamAggregator.java#L160)

ryanjbaxter commented 7 years ago

Turbine isnt actively being maintained anymore so the chances of getting that change in the 1.x branch is slim. I am going to close this issue for now since it wasnt a problem with Spring Cloud.

jwconway commented 7 years ago

I'm not sure i agree that it isn't a problem with spring cloud. Its unexpected behavior caused by netflix turbine but that manifests itself as unexpected behavior in spring cloud turbine.

Maybe some documentation around the limitations of running turbine over apps in clusters as is my case would be useful.

ryanjbaxter commented 7 years ago

Are you referring to registering the services with a unique host name?

jwconway commented 7 years ago

That and the fact that turbine sums propertyValue_metricsRollingStatisticalWindowInMilliseconds making rate calculations incorrect to the order of 1/(number of instances)

ryanjbaxter commented 7 years ago

Can you elaborate as to why 2 different instances were using the same IP address?

jwconway commented 7 years ago

We're running our services in containers on an ECS cluster.

ryanjbaxter commented 7 years ago

OK I am not familiar with ECS clusters but I suppose we can add a note to the turbine docs describing the situation....would you be up for submitting a PR?

jwconway commented 7 years ago

For hystrixCommand.js? Or the documentation?

ryanjbaxter commented 7 years ago

The documentation, I dont think we should change hystrixCommand.js

jwconway commented 7 years ago

No problem.