rax-maas / blueflood

A distributed system designed to ingest and process time series data
http://www.blueflood.io
Apache License 2.0
596 stars 102 forks source link

about delaying rollup #762

Open 42701618 opened 7 years ago

42701618 commented 7 years ago

Hi team: If the data of metric_5m is delaying when rollup,How should I find the delaying data accurately? Thanks

ChandraAddala commented 7 years ago

There are couple of ways to know.

1) Using JMX we can enable Tracking which will print delayed metrics in the logs. Blueflood does support jolokia. All you have to do is download jolokia jvm agent(Ex:jolokia-jvm-1.3.5-agent.jar) and when starting up blueflood, pass these jvm parameters.

-javaagent:<location of jar file>/jolokia/jolokia-jvm-1.3.5-agent.jar=port=9082,host=localhost 

If you hit the below url it will enable tracking. you should see a log statement "[TRACKER] Tracking delayed metrics started".

curl -i 'http://localhost:9082/jolokia/exec/com.rackspacecloud.blueflood.tracker:type=Tracker/setIsTrackingDelayedMetrics'

There are several functions you can invoke. Check the Tracker class to see available options for tracking. You can also log requests for certain tenants.

A metric is considered delayed if it arrives after ROLLUP_DELAY_MILLIS from its collection time, but tracker uses a different config( TRACKER_DELAYED_METRICS_MILLIS) to identify whether a metric is delayed or not. For regular scenarios set this config value TRACKER_DELAYED_METRICS_MILLIS, same as ROLLUP_DELAY_MILLIS and that should print delayed metrics in the logs.

2) We recently implemented a feature where we record delayed metrics in a separate column family(metrics_delayed_locator). If you use the latest code with RECORD_DELAYED_METRICS turned on, you should be able to see delayed metrics in that column family.

42701618 commented 7 years ago

Hi ChandraAddala : For method 1 : I can setIsTrackingDelayedMetrics,But I don't know how to add the tenant which is monitored. In my server,the column1 of one data in metric_5m is unchanging from this monday. The blueflood is running smoothly.I cann't find the root cause of this case. Thanks.

ChandraAddala commented 7 years ago

You can add a tenant like this.

curl -i 'http://localhost:9082/jolokia/exec/com.rackspacecloud.blueflood.tracker:type=Tracker/addTenant/$tenant_id'

I dont think I understand the issue very clearly. So you have a metric which is not being rolled in metrics_5m granularity? Did you check if there is data in metrics_full for that locator during the 5m time period which you believe it was not being rolled?

42701618 commented 7 years ago

I double check there is data in metrics_full for that locator during the 5m time period . E.X: In metric_full,the time is Mon Dec 05 2016 17:04:47 GMT+0800 (CST), Mon Dec 05 2016 16:59:47 GMT+0800 (CST) Mon Dec 05 2016 16:53:32 GMT+0800 (CST) Mon Dec 05 2016 16:47:51 GMT+0800 (CST) Mon Dec 05 2016 16:42:40 GMT+0800 (CST) ......

and in metric_5m,the last time has been still Sat Dec 03 2016 16:35:00 GMT+0800 (CST) until now.

ChandraAddala commented 7 years ago

Can you see if the locator is in metrics_locator column family?

You can compute the shard using Util.computeShard(locator) class.

Once you compute the shard value, you can check metrics_locator column family for that shard to see if that locator is present.

42701618 commented 7 years ago

I can find it in metrics_locator column family SELECT * FROM metrics_locator where column1=xxxx' allow filtering;

key | column1 | value -----+-----------------------------------+------- 120 |xxxxx | Is its shard 120 ?

ChandraAddala commented 7 years ago

I am not sure what happened but if you really want to debug, we can submit a delayed metric for that locator with the collection time corresponding to the time slot you think it wasnt rolled up. That delayed metric should trigger re-roll of that slot.

42701618 commented 7 years ago

Hi ChandraAddala: Just now,I find it is rolluping again from Tue Dec 06 2016 05:55:00 GMT+0800 (CST) in metric_5m. So it break off between Dec 03 2016 16:35:00 GMT+0800 (CST) and Tue Dec 06 2016 05:55:00 GMT+0800 (CST)

ChandraAddala commented 7 years ago

Did you happen to restart ingest nodes on Dec 6th? Jus wondering what might have changed on that day.

42701618 commented 7 years ago

no .Is the number of metric too large to rollup for some metric. There are 6 million per hour in one rollup blueflood,and *.com.rackspacecloud.blueflood.service.RollupService.5m-Re-rolling-up-because-of-delayed-metrics.m1_rate is still 0

42701618 commented 7 years ago

I query the metric about 4 hour ago: SELECT key, column1, WRITETIME (value), value FROM metrics_full WHERE key ='xxxxxxxxxxxxxxxxxxxxx' order by column1 desc limit 10;

key | column1 | writetime(value) | value ----------------------------------------------------------------------------------------------------------------------+---------------+------------------+------------------------ Thu Dec 08 2016 10:05:58 GMT+0800 (CST)

xxxxxxxxxxxxxxxxxxxxx | 1481162758259 | 1481162758296000 | 0x006e41ef8d2100002e40 xxxxxxxxxxxxxxxxxxxxx | 1481162412048 | 1481162412074000 | 0x006ed07b630800804c40 xxxxxxxxxxxxxxxxxxxxx | 1481162006830 | 1481162006904000 | 0x006ea1f7c61000003840 xxxxxxxxxxxxxxxxxxxxx | 1481161593110 | 1481161593190000 | 0x006ed07b630800804e40 xxxxxxxxxxxxxxxxxxxxx | 1481161183828 | 1481161183842000 | 0x006e41ef8d2100002a40 xxxxxxxxxxxxxxxxxxxxx | 1481160854002 | 1481160854022000 | 0x006ed07b630800004340 xxxxxxxxxxxxxxxxxxxxx | 1481160511273 | 1481160511289000 | 0x006ed07b630800804840 xxxxxxxxxxxxxxxxxxxxx | 1481160108043 | 1481160108053000 | 0x006ed07b630800004f40 xxxxxxxxxxxxxxxxxxxxx | 1481159774760 | 1481159774776000 | 0x006ea1f7c61000003340 xxxxxxxxxxxxxxxxxxxxx | 1481159403677 | 1481159403710000 | 0x006ed07b630800804440

(10 rows) SELECT key, column1, WRITETIME (value), value FROM metrics_full WHERE key ='xxxxxxxxxxxxxxxxxxxxx' order by column1 desc limit 10; SELECT * from "DATA".metrics_5m WHERE key ='xxxxxxxxxxxxxxxxxxxxx' order by column1 desc limit 10;

key | column1 | value ----------------------------------------------------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------- Tue Dec 06 2016 17:30:00 GMT+0800 (CST)

xxxxxxxxxxxxxxxxxxxxx | 1481016600000 | 0x0001006e41ef8d2100002440016e0000000000000000026e41ef8d2100002440036e41ef8d2100002440 xxxxxxxxxxxxxxxxxxxxx | 1481016300000 | 0x0001006ee8bd310400005740016e0000000000000000026ee8bd310400005740036ee8bd310400005740 xxxxxxxxxxxxxxxxxxxxx | 1481016000000 | 0x0001006ed07b630800004940016e0000000000000000026ed07b630800004940036ed07b630800004940 xxxxxxxxxxxxxxxxxxxxx | 1481015700000 | 0x0001006ea1f7c61000003340016e0000000000000000026ea1f7c61000003340036ea1f7c61000003340 xxxxxxxxxxxxxxxxxxxxx | 1481015400000 | 0x0001006ed07b630800004e40016e0000000000000000026ed07b630800004e40036ed07b630800004e40 xxxxxxxxxxxxxxxxxxxxx | 1481015100000 | 0x0001006ed07b630800804540016e0000000000000000026ed07b630800804540036ed07b630800804540 xxxxxxxxxxxxxxxxxxxxx | 1481014800000 | 0x0001006ea1f7c61000003140016e0000000000000000026ea1f7c61000003140036ea1f7c61000003140 xxxxxxxxxxxxxxxxxxxxx | 1481014200000 | 0x0001006ed07b630800804140016e0000000000000000026ed07b630800804140036ed07b630800804140 xxxxxxxxxxxxxxxxxxxxx | 1481013900000 | 0x0001006e41ef8d2100002040016e0000000000000000026e41ef8d2100002040036e41ef8d2100002040 xxxxxxxxxxxxxxxxxxxxx | 1481013600000 | 0x0001006ee8bd310400c05440016e0000000000000000026ee8bd310400c05440036ee8bd310400c05440

42701618 commented 7 years ago

I query the metric about 10minute ago: SELECT key, column1, WRITETIME (value), value FROM metrics_full WHERE key ='xxxxxxxxxxxxxxxxxxxxx' order by column1 desc limit 10;

key | column1 | writetime(value) | value ----------------------------------------------------------------------------------------------------------------------+---------------+------------------+------------------------ Thu Dec 08 2016 13:58:41 GMT+0800 (CST)

xxxxxxxxxxxxxxxxxxxxx | 1481176721528 | 1481176721552000 | 0x006ed07b630800804e40 xxxxxxxxxxxxxxxxxxxxx | 1481176426158 | 1481176426177000 | 0x006ed07b630800804a40 xxxxxxxxxxxxxxxxxxxxx | 1481176094344 | 1481176094365000 | 0x006ee8bd310400405140 xxxxxxxxxxxxxxxxxxxxx | 1481175739238 | 1481175739262000 | 0x006ed07b630800004440 xxxxxxxxxxxxxxxxxxxxx | 1481175348830 | 1481175348847000 | 0x006ed07b630800804740 xxxxxxxxxxxxxxxxxxxxx | 1481174900145 | 1481174900207000 | 0x006ed07b630800004440 xxxxxxxxxxxxxxxxxxxxx | 1481174497546 | 1481174497574000 | 0x006ed07b630800004640 xxxxxxxxxxxxxxxxxxxxx | 1481174164890 | 1481174165041000 | 0x006ed07b630800004e40 xxxxxxxxxxxxxxxxxxxxx | 1481173777998 | 1481173778028000 | 0x006ed07b630800004a40 xxxxxxxxxxxxxxxxxxxxx | 1481173344674 | 1481173344694000 | 0x006ed07b630800004f40

(10 rows) SELECT * from "DATA".metrics_5m WHERE key ='xxxxxxxxxxxxxxxxxxxxx' order by column1 desc limit 10;

key | column1 | value ----------------------------------------------------------------------------------------------------------------------+---------------+---------------------------------------------------------------------------------------- Tue Dec 06 2016 17:30:00 GMT+0800 (CST)

xxxxxxxxxxxxxxxxxxxxx | 1481016600000 | 0x0001006e41ef8d2100002440016e0000000000000000026e41ef8d2100002440036e41ef8d2100002440 xxxxxxxxxxxxxxxxxxxxx | 1481016300000 | 0x0001006ee8bd310400005740016e0000000000000000026ee8bd310400005740036ee8bd310400005740 xxxxxxxxxxxxxxxxxxxxx | 1481016000000 | 0x0001006ed07b630800004940016e0000000000000000026ed07b630800004940036ed07b630800004940 xxxxxxxxxxxxxxxxxxxxx | 1481015700000 | 0x0001006ea1f7c61000003340016e0000000000000000026ea1f7c61000003340036ea1f7c61000003340 xxxxxxxxxxxxxxxxxxxxx | 1481015400000 | 0x0001006ed07b630800004e40016e0000000000000000026ed07b630800004e40036ed07b630800004e40 xxxxxxxxxxxxxxxxxxxxx | 1481015100000 | 0x0001006ed07b630800804540016e0000000000000000026ed07b630800804540036ed07b630800804540 xxxxxxxxxxxxxxxxxxxxx | 1481014800000 | 0x0001006ea1f7c61000003140016e0000000000000000026ea1f7c61000003140036ea1f7c61000003140 xxxxxxxxxxxxxxxxxxxxx | 1481014200000 | 0x0001006ed07b630800804140016e0000000000000000026ed07b630800804140036ed07b630800804140 xxxxxxxxxxxxxxxxxxxxx | 1481013900000 | 0x0001006e41ef8d2100002040016e0000000000000000026e41ef8d2100002040036e41ef8d2100002040 xxxxxxxxxxxxxxxxxxxxx | 1481013600000 | 0x0001006ee8bd310400c05440016e0000000000000000026ee8bd310400c05440036ee8bd310400c05440

42701618 commented 7 years ago

compare to the two data,the metric data in metric_5m have not been rolluped again.

ChandraAddala commented 7 years ago

Do you see any exceptions during the rollup process? What about other granularities like 20m, 60m. Also what commit id have you deployed to get an idea of how much recent you code is.

ChandraAddala commented 7 years ago

We are using latest blueflood and rollups are working fine for us. Unless we can narrow down the definition of the issue, it is hard for us to debug.

So far I understood it as, even though you are ingesting metrics, the data is not getting rolled up. We have to figure out if it happens all the time for all metrics or if it is intermittent and happens with only few of them, so as to narrow the definition.

Can you also share the blueflood configurations you use for rollup nodes?

42701618 commented 7 years ago

there are no exceptions during the rollup process. The version of the code I use is rax-release-v1.0.1956,but I used rax-release-v1.0.2981 to rollup the data,the problem is still the same. the blueflood configurations I use are default config except that MAX_LOCATOR_FETCH_THREADS=32 MAX_ROLLUP_READ_THREADS=32 ,MAX_ROLLUP_WRITE_THREADS=32 which are fit for my cassandra.

42701618 commented 7 years ago

Hi ChandraAddala: Last friday,I cleared up all of the data in cassandra,and then ingest the data,and rollup the data. Today I found that a metric "xxxx" had been rollup but no point ,I query the earliest of this metric in metric_5m is Sun Dec 11 2016 14:25:00 GMT+0800 (CST). The following is log:

[2016-12-16 16:15:19,845 +0800] [DEBUG] [pool-18-thread-5] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:20,449 +0800] [DEBUG] [pool-18-thread-17] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:21,007 +0800] [DEBUG] [pool-18-thread-20] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:21,702 +0800] [DEBUG] [pool-18-thread-31] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:22,306 +0800] [DEBUG] [pool-18-thread-11] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:22,855 +0800] [DEBUG] [pool-18-thread-14] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:23,415 +0800] [DEBUG] [pool-18-thread-18] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:23,916 +0800] [DEBUG] [pool-18-thread-2] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:24,445 +0800] [DEBUG] [pool-18-thread-23] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:24,936 +0800] [DEBUG] [pool-18-thread-21] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:25,507 +0800] [DEBUG] [pool-18-thread-26] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:26,036 +0800] [DEBUG] [pool-18-thread-14] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:26,512 +0800] [DEBUG] [pool-18-thread-31] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:27,137 +0800] [DEBUG] [pool-18-thread-15] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:27,642 +0800] [DEBUG] [pool-18-thread-7] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:28,182 +0800] [DEBUG] [pool-18-thread-30] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:28,689 +0800] [DEBUG] [pool-18-thread-26] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx [2016-12-16 16:15:29,177 +0800] [DEBUG] [pool-18-thread-23] [com.rackspacecloud.blueflood.service.RollupRunnable:run 110] No points rollup for locator xxxx

I used rax-release-v2.0 to rollup the data,the problem is still the same. why have the metric which has no point been rollup in a short time?

ChandraAddala commented 7 years ago

"No points rollup for locator xxxx" This just means that when it tried to rollup a locator, that locator does not have any data points in metrics_full for that slot. For 5m granularity, slot is a 5min interval that corresponds to the slot number which was scheduled for rollup.

Why is it being rolled up when it has no data? The way this works is, we configure a rollup node with shard numbers(0-127) the rollup node is supposed to process. ScheduleContext schedules slots for rollup for the configured shards.

Lets say it schedules slot=0 for shard=5 for 5m granularity(from metrics_state table). So in LocatorFetchRunnable, it grabs all the locators corresponding to shard=5, from metrics_locator table and then for each locator it rollups data corresponding to slot=0(which would be for 5min since we are doing 5m granularity). Some of these locators might not have data points in which case you see the log statement ("No points rollup for locator xxxx"). If there is no incoming data on a particular locator, the locator gets TTL'd out after 7 days from metrics_locator table and it will be stopped from being rolled up.

42701618 commented 7 years ago

@ChandraAddala
I found one reason that the value which is *.metrics.com.rackspacecloud.blueflood.service.RollupService.Rejected-Slot-Checks.m1_rate is greater than 0(about 0.02)

ChandraAddala commented 7 years ago

I am not following you. Can you pls elaborate?

42701618 commented 7 years ago

The data of metric_5m is delaying when rollup in my blueflood server,then I found one reason that the value of "*.metrics.com.rackspacecloud.blueflood.service.RollupService.Rejected-Slot-Checks.m1_rate" is greater than 0(about 0.02).

In code ,if the value is greater than 0, that is to say // puts it back at the top of the list of scheduled slots. When this happens it means that // there is too much rollup work to do. if the CPU cores are not tapped out, it means you don't // have enough threads allocated to processing rollups or slot checks.

ChandraAddala commented 7 years ago

If there are too many slots to rollup(usually happens when there are lot of delayed metrics), it rejects them, the slots will be rejected. But there is nothing to worry about them. They will be picked up again later in the next run.

How many is too many? Its controlled by MAX_LOCATOR_FETCH_THREADS.