Monitoring System Improvements

linearregression / hypertable

Automatically exported from code.google.com/p/hypertable

GNU General Public License v2.0

0 stars 0 forks source link

Monitoring System Improvements #666

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago


Would like to see status of all services, including ThriftBrokers, Masters, and 
Hyperspace replicas.

Also, would like proactive alerts for each and every Hypertable service. If it 
goes down, we should send out an e-mail to the cluster administrators 
indicating what happened.

Original issue reported on code.google.com by nuggetwh...@gmail.com on 11 Aug 2011 at 6:18

GoogleCodeExporter commented 9 years ago

We should also add a link to the most recent error messages generated by each 
process.

Original comment by nuggetwh...@gmail.com on 11 Aug 2011 at 7:23

GoogleCodeExporter commented 9 years ago

Add the ability to set alerts.  See AWS Cloud Watch.

Original comment by nuggetwh...@gmail.com on 11 Aug 2011 at 10:06

GoogleCodeExporter commented 9 years ago

We should add a column to range servers table that indicates how many times it 
entered "low memory" mode in the last period.

Original comment by nuggetwh...@gmail.com on 16 Aug 2011 at 5:55

GoogleCodeExporter commented 9 years ago

More requests from Rediff:
- Add per-table average query latency to monitoring graphs
- QPS
- Response Code
- Latency
- Have alert thresholds
- They would like some measure of system utilization.  They want to know how 
much spare capacity they have on each server so they can proactively provision 
and buy more servers if necessary.

Original comment by nuggetwh...@gmail.com on 18 Aug 2011 at 10:41

GoogleCodeExporter commented 9 years ago

Original comment by nuggetwh...@gmail.com on 14 Jan 2012 at 8:33

Added labels: Milestone-ReleaseFuture

GoogleCodeExporter commented 9 years ago

It would be good to know on which range servers are the METADATA ranges, in 
particular the ROOT range.

Original comment by nuggetwh...@gmail.com on 14 May 2014 at 5:41

GoogleCodeExporter commented 9 years ago

We should add the ability to click on a table name and see the CREATE TABLE 
statement.

Original comment by nuggetwh...@gmail.com on 16 Jul 2014 at 4:02

GoogleCodeExporter commented 9 years ago

There should be a query log for each RangeServer.  We've encountered situations 
where there is a huge spike in scans/s on one particular range server that 
caused cluster-wide problems.  It would be good if we were able to figure out 
what queries were causing the problems.

Original comment by nuggetwh...@gmail.com on 17 Jul 2014 at 5:36

GoogleCodeExporter commented 9 years ago

It would be also good to know via the monitoring system is a RangeServer is 
stuck, not cleaning up its logs, due to load_acknowledged=false

Original comment by nuggetwh...@gmail.com on 24 Jul 2014 at 10:43

GoogleCodeExporter commented 9 years ago

Would be nice to know what DFS they're running on top of and if Hadoop, what 
Distro and version.

Original comment by nuggetwh...@gmail.com on 25 Jul 2014 at 5:21

GoogleCodeExporter commented 9 years ago

Just a thought here.  It seems like it would be useful if there was a button 
that dumped the current state of the monitoring UI to a file that can be pulled 
up and displayed with some reader.  That way if people have an issue, we can 
tell them to hit the button and send us the file and we can see what they see 
(to a certain extent).

Original comment by nuggetwh...@gmail.com on 25 Jul 2014 at 5:29

GoogleCodeExporter commented 9 years ago

After a balance, compactions kick off.  It would be good to know how many 
pending compactions there are so we can get an idea of when they'll finish.

Original comment by nuggetwh...@gmail.com on 6 Aug 2014 at 3:13

GoogleCodeExporter commented 9 years ago

Rediff would like to see the IP address of the client of the ThriftBroker in 
the slow query log.

Original comment by nuggetwh...@gmail.com on 11 Aug 2014 at 5:33

GoogleCodeExporter commented 9 years ago

(from Rediff) Query profiling:

- Number of scans that the query kicked off.  In other words, the number of ROW 
specs in an OR clause.  They want to sort by the maximum offenders.
- Number of bytes in the return payload. (probably should be done in 
ThriftBroker)
- Number of bytes scanned in the query.
- Knowing how many Ranges the query touched
- Knowing which RangeServers were involved

Original comment by nuggetwh...@gmail.com on 11 Aug 2014 at 3:17