processone / grapherl

ejabberd monitoring server
MIT License
91 stars 11 forks source link

Grapherl

Real-time scalable monitoring server

This project is deprecated and no longer developed.

Images

Grapherl

Quick start

Clone this repo to your system:

  $ git clone https://github.com/processone/grapherl.git

Prerequisites: Before executing make please make sure you have Erlang/OTP 17.x installed

Create directory for storing metric objects

  $ sudo mkdir -p /var/db/grapherl       

Compile and run

  $ cd grapherl
  $ make && sudo make console

Grapherl client is located at localhost:9090

NOTE: If make fails then mail the error at kansi13 at gmail dot com with the subject Grapherl compile error, you will get a reply within couple of minutes.

If you any question/issues regarding Grapherl you can also find me (kansi) at #erlang irc

Upgrading Grapherl : appup

NOTE: This feature is under construction don't use it.

For users who wish to upgrade from older version of Grapherl to newer one without restarting the Erlang VM they can execute the following:

  $ python upgrade.py VERSION_OF_YOUR_RUNNING_RELEASE
  $ python upgrade.py 0.2.0           # example

The above example shows how a user who is currently running release version 0.2.0 can updgrade to the latest release configured in upgrade.py

WARNING

For the upgrade to be successful following should be kept in mind

Getting data into Grapherl

Grapherl by default listens on port 11111. Format for sending a metric point is as follows:

  client_name/metric_name:metric_type/time_stamp:value

  randomClient1/memory_usage:g/1441005678:1002938389   # example

Sample python (data_feed.py) and erlang (testing.erl) modules which feed data into Grapherl located under the grapherl/tests directory. You can also play around with Grapherl by feeding data using these modules.

If you have any queries regarding feeding data into Grapherl mail them at kansi13 at gmail dot com.

Configurations

Grapherl consists of 2 components graph_db which receives UDP data and stores it, graph_web which retrieves this data and creates nice visualizations for the user.

graph_db

Brief description :

Before we discuss the various configurations we give an overview of how this subapp works so that the user can wisely configure these options. All incoming data is received by graph_db, multiple processes (known as router_workers) wait on the socket to receive high amount of UDP traffic.

These received packets are forwarded to a process called db_worker (which is pool of worker processes) which decodes this received packet and stores it in ram (inside ETS tables). All incoming points are aggregated into ram and after timeout are written to disk.

Further Grapherl expects huge amount of data, so storing such amount of data as is for long is not feasible. Hence, Grapherl constantly purges data according to a predefined scheme. To understand this scheme let consider that a client (i.e. a server which send total number of online users each second). The purging scheme works as follows:

Configurations

The following configurations can be found in file graph_db.app.src located at grapherl/apps/graph_db/src/graph_db.app.src

{storage_dir, <<"/var/db/grapherl/">>}

Specifies directory location where graph_db will store data points on disk. Note, user should make sure that directory exits and should start Grapherl with necessary permissions (i.e. root permissions in this case).

{ports, [11111]}

Specifies ports on which graph_db will listen. User can specify multiple port for eg. {ports, [11111, 11112]}

{num_routers, 3}

Router processes receive incoming UDP traffic. This configuration specifies the number of processes which will monitor each opened socket and receive incoming data. The current configuration can handle around 1 million points per minute. It should be noted that mindlessly increasing the number of processes monitoring the socket can degrade performance.

{cache_to_disk_timeout, 60000}

Specifies the timeout (in millisecond) after which the accumulated points stored in ram will dumped onto the disk.

{db_daemon_timeout, 60000}

This options defines the timeout (in millisecond) after which data points stored (on disk) are checked for purging.

{cache_mod, db_ets}
{db_mod, db_levelDB}

cache_mod defines the module to be used to ram storage and db_mod defines the module to be used for disk storage. By default graph_db uses ETS for ram storage and levelDB for disk storage but the user is not restricted to using these defaults. Users can write their custom db modules, place them in the src directory of graph_db app. The user must note that these modules are based on custom behaviour called gen_db (defined in graph_db). In order to write custom module user can refer to the existing implementation or submit an issue to support the given db.

How to optimize you graph_db configuration

Configuring graph_db according to the expected load is very crucial to achieve best performance. For eg. too much router_worker processes can degrade performance, not having or having more number of db_worker than the hardware can support will also degrade performance. Also cache_to_disk_timeout should be carefully decided in accordance with the expected UDP traffic so that you don't run out of ram. Lastly, keeping db_daemon_timeout very low can lead to unnecessary processing hence degrading performance.

So, we discuss some performance details of graph_db. NOTE this testing was done on second generation Intel(R) Core(TM) i5-2430M CPU (4 processors). The Grapherl directory contains a module named testing.erl, which has been used to test graph_db. Following are some results:

Handling huge number of data points

If you are someone who wants to go beyond receiving 1 million points per minute, Grapherl has something for you. You don't need to spin up another Grapherl instance for that, all you need to do is throw some more hardware at Grapherl and tweak the configurations. Assuming you have bought more hardware, to handle more data it advisable to receive data on multiple ports for eg. if you use 2 ports to receive data you can already receive 2 million points per minute. Now, to handle these data points you will need to have more db workers (minimum 6). And since you are going to increase db_workers make sure you have sufficient cpu threads (at least 8 if you run 6 db_workers).

NOTE: The configurations suggested in this section are mere speculation based on previously discussed testing results. You can test Grapherl using the testing.erl module and while you are testing you can monitor the system using native erlang app called observer which has been included in Grapherl.

Handling large number of metrics

In case you want to track a lot of metrics graph_db allows the user to bootstrap ram and disk db objects for metrics before any data starts coming in. Doing this will be helpful because creating ram and disk db objects is a time consuming task, so while receiving such huge traffic it is advisable that the user bootstrap some of the metrics so that the system doesn't fall under sudden load (though the app can handle sudden loads it just to assure constant cpu usage). In order to bootstrap metric user needs have a file in the following format:

 cpu_usage, g
 user_count, c
 memory_usage, g
 system_load, g

each line contains the metric name and type separate by comma. Once you have this file created, execute the following in the Grapherl (erlang) shell:

 db_manager:pre_process_metric("/absolute/path/to/metric/file")

NOTE: the above routine of bootstrapping metric is purely optional. This is be used in case you want to track a lot of metrics and that too when you expect to receive a burst of new data points none of which has its corresponding metric objects created.

When tracking a large number of metrics it is advisable to increase the ulimit. For eg. if you are tracking like 500 different metrics then set ulimit to ulimit -n 10000.

graph_db across multiple boots

graph_db maintains a list of metric names for which it is receiving data. This state is stored in a file name db_manager.dat located in storage_dir directory. This is to ensure that even across multiple VM restarts or in case of a VM crash graph_db knows which metrics objects it was receiving. So, if you want to restart Grapherl but don't want it reload its previous state remove this file before restarting. On the other hand, if you want to migrate Grapherl to some other server but want it to be at the same state where the current instance of Grapherl is running then just take the db_manager.dat file and place it in storage_dir.

Note: the storage format for db_manager.dat is same as that of the bootstrap file discussed in the previous section.

graph_web

Brief description:

This sub-app is responsible for serving data gathered by Grapherl. There isn't much to configure in graph_web except the port at which the web server listens. The default port is 9090 but user is free to change it acc to their needs. But remember to start Grapherl with necessary permissions (eg. sudo in case port < 1024).

Now we discuss various features offered on the client side.

NOTE: Though Grapherl allows users to specify granularity at which they want to see data, graph_web serves data based on its availability and not on queried granularity. What this means is that if user wants to retrieve data at a particular granularity, Grapherl will try it best to serve at queried granularity. If the data is not available at the queried granularity then Grapherl will serve data at higher or lower granularity depending on which ever is available. If data is higher granularity then it is compressed to the queried granularity.

Contributors