reanahub / reana

REANA: Reusable research data analysis platform
https://docs.reana.io
MIT License
127 stars 54 forks source link

benchmark: new `monitor` command for DB and K8S statuses #574

Closed tiborsimko closed 2 years ago

tiborsimko commented 3 years ago

(stems from https://github.com/reanahub/reana/pull/541#discussion_r716529942)

Current behaviour

Currently, while running benchmarking script, one can monitor the DB status and the K8S status independently via a script like:

#!/bin/sh

workflow=$1

while true; do

    echo "$(date +%Y-%m-%dT%H:%M:%S) db pg_stat_activity count:"
    kubectl exec deployment/reana-db -- psql -U reana -c "SELECT COUNT(*) FROM pg_stat_activity;"

    echo "$(date +%Y-%m-%dT%H:%M:%S) db workflow.status count:"
    kubectl exec deployment/reana-db -- psql -U reana -c "SELECT status,COUNT(*) FROM __reana.workflow WHERE name LIKE '$workflow-%' GROUP BY status;"

    echo "$(date +%Y-%m-%dT%H:%M:%S) run-b pods count:"
    kubectl get pods | grep run-b | grep -v STATUS | awk '{print $3}' | sort|uniq -c|sort -rn|head -20|awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#"; printf "%25s %5d %s %s",$2,$1,r,"\n";}'

    echo "$(date +%Y-%m-%dT%H:%M:%S) run-j pods count:"
    kubectl get pods | grep run-j | grep -v STATUS | awk '{print $3}' | sort|uniq -c|sort -rn|head -20|awk '!max{max=$1;}{r="";i=s=60*$1/max;while(i-->0)r=r"#"; printf "%25s %5d %s %s",$2,$1,r,"\n";}'

    echo

    sleep 30

done

This gives output as follows, for one particular moment of time:

2021-10-31T20:48:35 db pg_stat_activity count:
 count
-------
   182
(1 row)

2021-10-31T20:48:35 db workflow.status count:
 status  | count
---------+-------
 running |   177
 queued  |    89
 pending |   134
(3 rows)

2021-10-31T20:48:36 run-b pods count:
                  Running   190 ############################################################
        ContainerCreating     3 #
2021-10-31T20:48:36 run-j pods count:
                  Running   198 ############################################################
                 Init:0/1    16 #####
          PodInitializing     7 ###

and, 30 seconds later:

2021-10-31T20:49:07 db pg_stat_activity count:
 count
-------
   212
(1 row)

2021-10-31T20:49:07 db workflow.status count:
 status  | count
---------+-------
 pending |   197
 running |   203
(2 rows)

2021-10-31T20:49:08 run-b pods count:
                  Running   217 ############################################################
2021-10-31T20:49:09 run-j pods count:
                  Running   314 ############################################################
                 Init:0/1    15 ###
          PodInitializing     7 ##
                  Pending     1 #

These time snapshots allow to monitor the number of DB connections, the DB statuses vs K8S statuses, the number of "Running" pods vs "Pending" pods, to see how fast the pods terminate, etc, giving a complementary picture of what's happening in the cluster.

The trouble is that this "side" monitoring is a bit "detached" from the main content output of the benchmark scripts. It'll be advantageous to better correlate this information with the workflow burn down plots.

Expected behaviour

We can introduce a new command monitor --sleep 30 which would do the above automatically and which would collect the information in either the textual format above (MVP), or even better in a CSV format that will allow to plot nice DB and K8S status evolution graphs later about how the measured DB and K8S quantities evolve as a function of time.

For example, once #573 is implemented, we shall have a "real time arrow" representation of the workflow burn down in the cluster, and the DB info plots and K8S info plots will nicely complement the overall picture about what's happening in the cluster.

They might show graphical insight into "orange hill" and "blue spread" phenomena, such as the transition from "Running -> NotReady -> Terminating" status of workflow pods.

VMois commented 2 years ago

I would prefer to store data in CSV format (or JSON) to be able to connect it with collected_results.csv later.

After some investigation, it looks like more thoughts will be needed on how monitored data should be structured and saved.

  1. For example, the number of DB connections is "easy" to structure in CSV file:
monitored_date,db_connections_number
2021-11-16T10:24:12,15
2021-11-16T10:25:12,20

If we want to add workflow statuses, it gets a bit more complicated:

monitored_date,db_connections_number,status,count
2021-11-16T10:24:12,15,running,5
2021-11-16T10:24:12,15,pending,2
2021-11-16T10:25:12,20,running,6
2021-11-16T10:25:12,20,pending,1

If we want to add pod statuses, it gets even more complicated:

monitored_date,db_connections_number,status,count,type,type_count
2021-11-16T10:24:12,15,running,5,run-b,5
2021-11-16T10:24:12,15,running,5,run-j,2
...

Splitting into multiple CSV files can help. But will introduce more complexity in analyze to merge them together.

  1. Another idea is to use a single JSON file instead of multiple CSV files:
{
"2021-11-16T10:24:12": {
        "db_connection_number": 15,
        "workflow_statuses": {
            "running": 5,
            "pending": 2
        }
    }
}

This is a more flexible approach. It is also possible to extend the file with new metrics by just adding a new entry under the 2021-11-16T10:24:12 key. In the analyze command, I can just use key (date) to plot metrics.

P.S While writing my findings, I realized that JSON looks like a good idea. Writing stuff down helps a lot :)

P.S P.S This whole problem with how to save data is a classical "structured vs non-structured data" debate.

VMois commented 2 years ago

suggestion: The point of this issue is to develop a monitor command only. I will add another issue that will focus on how the analyze command will use monitored data and plot it alongside what we have already.

VMois commented 2 years ago

Another thing, I will use subprocess to execute commands and parse the output. Maybe, it is not as effective as using some API (like a python-k8s library) but it is simpler to start. We can improve later if needed.