Real-time scalable monitoring server
This project is deprecated and no longer developed.
Clone this repo to your system:
$ git clone https://github.com/processone/grapherl.git
Prerequisites: Before executing make
please make sure you have Erlang/OTP 17.x installed
Create directory for storing metric objects
$ sudo mkdir -p /var/db/grapherl
Compile and run
$ cd grapherl
$ make && sudo make console
Grapherl client is located at localhost:9090
NOTE:
If make
fails then mail the error at kansi13 at gmail dot com
with the subject Grapherl compile error
,
you will get a reply within couple of minutes.
If you any question/issues regarding Grapherl you can also find me (kansi) at #erlang irc
NOTE: This feature is under construction don't use it.
For users who wish to upgrade from older version of Grapherl to newer one without restarting the Erlang VM they can execute the following:
$ python upgrade.py VERSION_OF_YOUR_RUNNING_RELEASE
$ python upgrade.py 0.2.0 # example
The above example shows how a user who is currently running release version 0.2.0
can updgrade to the latest release configured
in upgrade.py
For the upgrade to be successful following should be kept in mind
upgrade.py
is hardcoded to upgrade Grpaherl to the latest version in the commit.0.2.0
to 0.2.1
. Jumping from version lower one (eg 0.2.0
to 0.2.3
) has not been tested.Grapherl by default listens on port 11111
. Format for sending a metric point is as follows:
client_name/metric_name:metric_type/time_stamp:value
randomClient1/memory_usage:g/1441005678:1002938389 # example
server01
, website01.com
, website 101
, 1284398
etc. NOTE Grapherl considers client_name just as plane string meaning that there is no special interpretation for website_101.com
as compared website_101
, they are both just (different) strings for Grapherl.g
(gauge) and c
(counter). Metric types are discussed here. By default for a gauge metric values are averaged over and interval and for counter metric values are added over an interval.Sample python (data_feed.py
) and erlang (testing.erl
) modules which feed data into Grapherl located under the grapherl/tests
directory.
You can also play around with Grapherl by feeding data using these modules.
If you have any queries regarding feeding data into Grapherl mail them at kansi13 at gmail dot com
.
Grapherl consists of 2 components graph_db which receives UDP data and stores it, graph_web which retrieves this data and creates nice visualizations for the user.
Brief description :
Before we discuss the various configurations we give an overview of how this subapp works so that the user can wisely configure these options. All incoming data is received by graph_db, multiple processes (known as router_workers) wait on the socket to receive high amount of UDP traffic.
These received packets are forwarded to a process called db_worker (which is pool of worker processes) which decodes this received packet and stores it in ram (inside ETS tables). All incoming points are aggregated into ram and after timeout are written to disk.
Further Grapherl expects huge amount of data, so storing such amount of data as is for long is not feasible. Hence, Grapherl constantly purges data according to a predefined scheme. To understand this scheme let consider that a client (i.e. a server which send total number of online users each second). The purging scheme works as follows:
The following configurations can be found in file graph_db.app.src
located at grapherl/apps/graph_db/src/graph_db.app.src
{storage_dir, <<"/var/db/grapherl/">>}
Specifies directory location where graph_db will store data points on disk. Note, user should make sure that directory exits and should start Grapherl with necessary permissions (i.e. root permissions in this case).
{ports, [11111]}
Specifies ports on which graph_db will listen. User can specify multiple port for eg. {ports, [11111, 11112]}
{num_routers, 3}
Router processes receive incoming UDP traffic. This configuration specifies the number of processes which will monitor each opened socket and receive incoming data. The current configuration can handle around 1 million points per minute. It should be noted that mindlessly increasing the number of processes monitoring the socket can degrade performance.
{cache_to_disk_timeout, 60000}
Specifies the timeout (in millisecond) after which the accumulated points stored in ram will dumped onto the disk.
{db_daemon_timeout, 60000}
This options defines the timeout (in millisecond) after which data points stored (on disk) are checked for purging.
{cache_mod, db_ets}
{db_mod, db_levelDB}
cache_mod defines the module to be used to ram storage and db_mod defines the module to be used for
disk storage. By default graph_db uses ETS for ram storage and levelDB for disk storage but the user is not restricted
to using these defaults. Users can write their custom db modules, place them in the src
directory of graph_db app.
The user must note that these modules are based on custom behaviour called gen_db (defined in graph_db). In order to write
custom module user can refer to the existing implementation or submit an issue to support the given db.
Configuring graph_db according to the expected load is very crucial to achieve best performance. For eg. too much router_worker processes can degrade performance, not having or having more number of db_worker than the hardware can support will also degrade performance. Also cache_to_disk_timeout should be carefully decided in accordance with the expected UDP traffic so that you don't run out of ram. Lastly, keeping db_daemon_timeout very low can lead to unnecessary processing hence degrading performance.
So, we discuss some performance details of graph_db. NOTE this testing was done on second generation
Intel(R) Core(TM) i5-2430M CPU (4 processors). The Grapherl directory contains a module named testing.erl
,
which has been used to test graph_db. Following are some results:
If you are someone who wants to go beyond receiving 1 million points per minute, Grapherl has something for you. You don't need to spin up another Grapherl instance for that, all you need to do is throw some more hardware at Grapherl and tweak the configurations. Assuming you have bought more hardware, to handle more data it advisable to receive data on multiple ports for eg. if you use 2 ports to receive data you can already receive 2 million points per minute. Now, to handle these data points you will need to have more db workers (minimum 6). And since you are going to increase db_workers make sure you have sufficient cpu threads (at least 8 if you run 6 db_workers).
NOTE: The configurations suggested in this section are mere speculation based on previously discussed testing results.
You can test Grapherl using the testing.erl
module and while you are testing you can monitor the system using native
erlang app called observer
which has been included in Grapherl.
In case you want to track a lot of metrics graph_db allows the user to bootstrap ram and disk db objects for metrics before any data starts coming in. Doing this will be helpful because creating ram and disk db objects is a time consuming task, so while receiving such huge traffic it is advisable that the user bootstrap some of the metrics so that the system doesn't fall under sudden load (though the app can handle sudden loads it just to assure constant cpu usage). In order to bootstrap metric user needs have a file in the following format:
cpu_usage, g
user_count, c
memory_usage, g
system_load, g
each line contains the metric name and type separate by comma. Once you have this file created, execute the following in the Grapherl (erlang) shell:
db_manager:pre_process_metric("/absolute/path/to/metric/file")
NOTE: the above routine of bootstrapping metric is purely optional. This is be used in case you want to track a lot of metrics and that too when you expect to receive a burst of new data points none of which has its corresponding metric objects created.
When tracking a large number of metrics it is advisable to increase the ulimit
. For eg. if you are tracking like 500
different metrics then set ulimit to ulimit -n 10000
.
graph_db maintains a list of metric names for which it is receiving data. This state is stored in a file name db_manager.dat
located in storage_dir
directory. This is to ensure that even across multiple VM restarts or in case of a VM crash
graph_db knows which metrics objects it was receiving. So, if you want
to restart Grapherl but don't want it reload its previous state remove this file before restarting. On the other hand, if you
want to migrate Grapherl to some other server but want it to be at the same state where the current instance of Grapherl is
running then just take the db_manager.dat
file and place it in storage_dir
.
Note: the storage format for db_manager.dat
is same as that of the bootstrap file discussed in the previous section.
Brief description:
This sub-app is responsible for serving data gathered by Grapherl. There isn't much to configure in graph_web except the port at which the web server listens. The default port is 9090 but user is free to change it acc to their needs. But remember to start Grapherl with necessary permissions (eg. sudo in case port < 1024).
Now we discuss various features offered on the client side.
NOTE: Though Grapherl allows users to specify granularity at which they want to see data, graph_web serves data based on its availability and not on queried granularity. What this means is that if user wants to retrieve data at a particular granularity, Grapherl will try it best to serve at queried granularity. If the data is not available at the queried granularity then Grapherl will serve data at higher or lower granularity depending on which ever is available. If data is higher granularity then it is compressed to the queried granularity.