Open lukemarsden opened 3 years ago
MLFlow is exposing metrics for prometheus in '/metrics'' endpoint.
I have added the prometheus
app from the prometheus-k8s
charm and a relationship with MLFlow. MLFlow correctly checks the prometheus information and sends the configuration for scrapping:
2021-02-17 07:14:31 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:31 INFO juju-log prometheus:30: _on_prometheus_relation_changed is running; <ops.charm.RelationJoinedEvent object at 0x7f98ce7027c0>
2021-02-17 07:14:31 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:31 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:31 INFO juju-log prometheus:30: _on_prometheus_relation_changed is running; Info received
2021-02-17 07:14:31 INFO juju-log prometheus:30: 9090
2021-02-17 07:14:31 INFO juju-log prometheus:30: /metrics
2021-02-17 07:14:31 INFO juju-log prometheus:30: {"host": "prometheus-0"}
2021-02-17 07:14:31 INFO juju-log prometheus:30: None
2021-02-17 07:14:31 INFO juju-log prometheus:30: None
2021-02-17 07:14:32 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:32 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:32 INFO juju-log prometheus:30: _on_prometheus_relation_changed is running; Info send
2021-02-17 07:14:32 INFO juju-log prometheus:30: 5000
2021-02-17 07:14:32 INFO juju-log prometheus:30: /metrics
2021-02-17 07:14:32 INFO juju-log prometheus:30: 1m
2021-02-17 07:14:32 INFO juju-log prometheus:30: None
But when I see to the prometheus logs:
root@e81400ea4dc43c92:~# kubectl logs -n kf prometheus-0 -c prometheus-k8s
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:293 msg="no time or size retention was set so using the default time retention" duration=15d
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.12.0, branch=HEAD, revision=43acd0e2e93f9f70c49b2267efa0124f1e759e86)"
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:330 build_context="(go=go1.12.8, user=root@7a9dbdbe0cc7, date=20190818-13:53:16)"
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:331 host_details="(Linux 5.4.43 #1 SMP Mon Jun 1 17:26:49 UTC 2020 x86_64 prometheus-0 (none))"
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-02-17T07:17:18.185Z caller=main.go:654 msg="Starting TSDB ..."
level=info ts=2021-02-17T07:17:18.185Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-02-17T07:17:18.189Z caller=head.go:509 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2021-02-17T07:17:18.190Z caller=head.go:557 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2021-02-17T07:17:18.190Z caller=main.go:669 fs_type=EXT4_SUPER_MAGIC
level=info ts=2021-02-17T07:17:18.190Z caller=main.go:670 msg="TSDB started"
level=info ts=2021-02-17T07:17:18.190Z caller=main.go:740 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-02-17T07:17:18.191Z caller=main.go:768 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-02-17T07:17:18.191Z caller=main.go:623 msg="Server is ready to receive web requests."
level=error ts=2021-02-17T07:17:23.192Z caller=scrape.go:352 component="scrape manager" scrape_pool=prometheus msg="creating targets failed" err="instance 0 in group 0: no address"
It seems that it is not correctly configured.
Solved by adding the default target: https://github.com/mlopsworks/charms/blob/2609c7ee817361076821f7d5750e41a84ba31929/bundle.yaml#L18
I have not been able to get to see the prometheus graphical interface, not even doing a port redirection as we did for mlflow, minio and kubeflow. But when requesting the targets page of the prometheus server (curl --request GET "10.103.173.33:9090/targets"
) within the cluster we can see that the scrapping is working without errors:
<tbody>
<tr>
<td class="endpoint">
<a href="http://mlflow:5000/metrics">http://mlflow:5000/metrics</a><br>
</td>
<td class="state">
<span class="alert alert-success state_indicator text-uppercase">up</span>
</td>
<td class="labels">
<span class="cursor-pointer" data-toggle="tooltip" title="" data-html=true data-original-title="<b>Before relabeling:</b><br>__address__="mlflow:5000"<br>__metrics_path__="/metrics"<br>__scheme__="http"<br>job="prometheus"">
<span class="badge badge-primary">instance="mlflow:5000"</span>
<span class="badge badge-primary">job="prometheus"</span>
</span>
</td>
<td class="last-scrape">1.261s ago</td>
<td class="scrape-duration">4.751ms</td>
<td class="errors"></td>
</tr>
</tbody>
expose prometheus metrics from mlflow into the existing prometheus charm
presumably https://jaas.ai/u/charmed-osm/prometheus-k8s and https://www.mlflow.org/docs/latest/cli.html#cmdoption-mlflow-server-expose-prometheus