mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

50th, 75th, 95th and 99th percentile timing issue #664

Open jie-qin opened 8 years ago

jie-qin commented 8 years ago

There's no way 50th, 75th, 95th and 99th percentile timing of ~1k job executions are exactly the same. It actually varies a lot.

I searched around and couldn't find any relevant information. Most likely I am doing something wrong. Can someone help point a direction or share some insights?

Much appreciate.

image

mwilbz commented 6 years ago

Old issue, but if anyone looks at this -- it's probably due to Dropwizard's MetricRegistry.histogram() method used in the JobMetrics class. By default, Dropwizard Metrics is exponentially decaying old data with a "factor of 0.015, which heavily biases the reservoir to the past 5 minutes of measurements." (See ExponentiallyDecayingReservoir.java) We have a fork of Chronos and we're changing the relevant line to

registry.register(MetricRegistry.name("jobs", "run", name, jobName), new Histogram(new SlidingWindowReservoir(100)))