As mostly discussed in #6 we want to add additional stats to be collected by the api (and not using TPV)
For this to work we need:
Telegraf plugins to collect measurements:
GalaxyDB SQL queries with all the necessary attributes (tool-by-tool basis, destination basis, etc.):
Compute destination monitoring scripts
Design the Influxdb measurement such that there won't be much computation required when TPV-API is querying, so TPV will not spend more than a few milliseconds in fetching the necessary data
The metrics we would want/need:
Static information of the job:
Requested cpus/gpus
Requested mem
This information is directly available in TPV
Dynamic information about the job, expressed as the combination of tool and destination (frequency: daily):
Information on the total capacity/ current free resources (condor_status)
Shell scripts that give an overview of the cluster allocation and availability in an influx compatible way.
Note: We need a plan for configuring the remote/Pulsar destinations to ship the data to the InfluxDB. The EU could bake the Pulsar images with the required credentials and scripts to push the data to the EU's InfluxDB. This way, we do not have to establish a dedicated resource for this and can use what the EU already has.
As mostly discussed in #6 we want to add additional stats to be collected by the api (and not using TPV)
For this to work we need:
Telegraf plugins to collect measurements:
Design the Influxdb measurement such that there won't be much computation required when TPV-API is querying, so TPV will not spend more than a few milliseconds in fetching the necessary data
The metrics we would want/need:
Static information of the job:
This information is directly available in TPV
Dynamic information about the job, expressed as the combination of tool and destination (frequency: daily):
$ gxadmin query destination-queue-run-time --seconds --older-than=90
Dynamic information about the destination (frequency: 30 mins):
$ gxadmin query queue --by destination
Shell scripts that give an overview of the cluster allocation and availability in an influx compatible way.
Note: We need a plan for configuring the remote/Pulsar destinations to ship the data to the InfluxDB. The EU could bake the Pulsar images with the required credentials and scripts to push the data to the EU's InfluxDB. This way, we do not have to establish a dedicated resource for this and can use what the EU already has.