perfsonar / esmond

A measurement archive for collecting time series metrics from networks.
Other
10 stars 9 forks source link

change the way min and max are reported for stat_aggregations #32

Open jdugan1024 opened 10 years ago

jdugan1024 commented 10 years ago

Currently min and max return the min or max byte delta of the underlying frequency. This is not particularly useful and is fairly confusing. Instead we should return the rate of the min or max.

Just to clarify, say you have 30 second base rate data that gets rolled up into 5 minute aggregates. The min and max should return the min or max 30 second rate rather than the min or max 30 second byte delta. This is just a simple division but it's hard to know what the divisor should be on the client side.

There are two suggested approaches:

  1. store the rate directly at the time the min and max are updated
  2. lookup the underlying rate from the SQL database and use that to do the division at query time.

This was an oversight on my part but we need to rectify it.

@mitchell-es we should discuss this at Friday's meeting as it may impact the migration of our old data.

Reported by @antonycourtney.

mitchell-es commented 10 years ago

On Oct 22, 2014, at 5:37 PM, Jon Dugan notifications@github.com wrote:

Currently min and max return the min or max byte delta of the underlying frequency. This is not particularly useful and is fairly confusing. Instead we should return the rate of the min or max.

Just to clarify, say you have 30 second base rate data that gets rolled up into 5 minute aggregates. The min and max should return the min or max 30 second rate rather than the min or max 30 second byte delta. This is just a simple division but it's hard to know what the divisor should be on the client side.

There are two suggested approaches:

store the rate directly at the time the min and max are updated lookup the underlying rate from the SQL database and use that to do the division at query time.

I don’t consider the data currently on the new systems to be final and will re-migrate the data anyway. The min and max should absolutely be the same units as the rate itself in the database. Anything else is asking for confusion down the road I think.

-David

This was an oversight on my part but we need to rectify it.

@mitchell-es we should discuss this at Friday's meeting as it may impact the migration of our old data.

Reported by @antonycourtney.

— Reply to this email directly or view it on GitHub.

David Mitchell, Network Engineer Energy Sciences Network (ESnet) Lawrence Berkeley National Laboratory (LBL) Email:mitchell@es.net Phone:(510) 936-0720

jdugan1024 commented 10 years ago

After trying to use the max values by dividing them by 30 (for the 300 second maxes) I discovered that the max and min computations are wrong. The script below shows this empirically. I believe the problem is that we are using one of the fractions of a bin to update the min and max should mean that the store max is actually smaller than the actual max but it appears that isn't always the case.

Anyway, the code the calculates the min and max needs more thought as well as the presentation in the API.

import requests

agg = requests.get("http://snmp-east1.es.net/v1/device/sunn-cr5/interface/7@2F1@2F1/in/?format=json&begin=1409723260&end=1409730460&agg=300&cf=max")

agg = [ x['val']/30.0 for x in agg.json()['data'] ]

base = requests.get("http://snmp-east1.es.net/v1/device/sunn-cr5/interface/7@2F1@2F1/in/?format=json&begin=1409723260&end=1409730460")
base = [ x['val'] for x in base.json()['data'] ]

print len(agg), len(base)

bmax = []
for i in range(len(agg)):
    j = (i*10)
    k = j+10
    bmax.append(max(base[j:k]))
    print "==", j, k, "=" * 80
    print "%20.2f %20.2f %20.2f %20.2f" % (agg[i], bmax[i], agg[i]-bmax[i],
            agg[i]/bmax[i])
    print "A ", agg[i]
    for x in sorted(base[j:k], reverse=True):
        print "B ", x
jdugan1024 commented 10 years ago

Also, we need to add tests to ensure that min and max are working properly.

mitchell-es commented 10 years ago

CassandraPollPersister.aggregate_base_rate() returns delta_v which is the total delta, not one of the values from fit_to_bins(). It's that returned delta_v value which gets sent to generate_aggregations(). That would explain why the values found in min/max don't match up to the base rate values. I'm still not sure what the best way to fix the issue is.

jdugan1024 commented 10 years ago

Huh, OK. I am very surprised by that but glad to have the explanation.

On Fri, Oct 24, 2014 at 5:42 PM, mitchell-es notifications@github.com wrote:

CassandraPollPersister.aggregate_base_rate() returns delta_v which is the total delta, not one of the values from fit_to_bins(). It's that returned delta_v value which gets sent to generate_aggregations(). That would explain why the values found in min/max don't match up to the base rate values. I'm still not sure what the best way to fix the issue is.

— Reply to this email directly or view it on GitHub https://github.com/esnet/esmond/issues/32#issuecomment-60458877.

mitchell-es commented 10 years ago

In looking at this code I wonder if it makes sense to unify the way that rates and min/max are processed and stored. They are currently handled more as special cases but they are all similar in that they are derived values of the raw data. Rather than try to fill the derived buckets up as we go the code could just look at the timestamp of the last raw data value and determine if it was time to generate a derived value. The code would need to have access to the last N values of raw_data in order to calculate the value (where N is however many datapoints are needed to calculate the bucket) but I think that's pretty much going to be required to maintain the min and max. By treating all of the derived functions in the same way it should make the code a bit easier to understand with fewer special cases.

It makes it easier to new derived functions in the future (like maybe we want to accumulate/integrate instead of doing a derivative). It should also make it a little easier to tune what derived values are kept for what OIDs. I can image for example tracking min and max for things like CPU load, temperature, where a rate doesn't make sense.

mitchell-es commented 9 years ago

There's a related issue with the current calculation of rates. They take advantage of the 'counter' operation to incrementally update the rate bin. But in production I'm seeing occasional timeouts with these operations. Once of the nuances of counters is that they can't be easily retried. Pycassa will not retry them by default. I could have our code catch those exceptions and retry them but that may or may not result in double counting, depending on how far the data made it into Cassandra before the exception was raised. Problems with counters are discussed here https://issues.apache.org/jira/browse/CASSANDRA-4775

I think the counter issue argues for the above model of having the derived buckets calculated once all the needed raw_data points are available and then written with a single idempotent insert.