mlopsworks / charms

WIP charms
Apache License 2.0
5 stars 3 forks source link

connect mlflow charm to prometheus charm #7

Open lukemarsden opened 3 years ago

lukemarsden commented 3 years ago

expose prometheus metrics from mlflow into the existing prometheus charm

presumably https://jaas.ai/u/charmed-osm/prometheus-k8s and https://www.mlflow.org/docs/latest/cli.html#cmdoption-mlflow-server-expose-prometheus

egranell commented 3 years ago

MLFlow is exposing metrics for prometheus in '/metrics'' endpoint. imagen

I have added the prometheus app from the prometheus-k8s charm and a relationship with MLFlow. MLFlow correctly checks the prometheus information and sends the configuration for scrapping:

2021-02-17 07:14:31 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:31 INFO juju-log prometheus:30: _on_prometheus_relation_changed is running; <ops.charm.RelationJoinedEvent object at 0x7f98ce7027c0>
2021-02-17 07:14:31 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:31 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:31 INFO juju-log prometheus:30: _on_prometheus_relation_changed is running; Info received 
2021-02-17 07:14:31 INFO juju-log prometheus:30: 9090
2021-02-17 07:14:31 INFO juju-log prometheus:30: /metrics
2021-02-17 07:14:31 INFO juju-log prometheus:30: {"host": "prometheus-0"}
2021-02-17 07:14:31 INFO juju-log prometheus:30: None
2021-02-17 07:14:31 INFO juju-log prometheus:30: None
2021-02-17 07:14:32 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:32 INFO juju-log prometheus:30: ================================
2021-02-17 07:14:32 INFO juju-log prometheus:30: _on_prometheus_relation_changed is running; Info send
2021-02-17 07:14:32 INFO juju-log prometheus:30: 5000
2021-02-17 07:14:32 INFO juju-log prometheus:30: /metrics
2021-02-17 07:14:32 INFO juju-log prometheus:30: 1m
2021-02-17 07:14:32 INFO juju-log prometheus:30: None

But when I see to the prometheus logs:

root@e81400ea4dc43c92:~# kubectl logs -n kf prometheus-0 -c prometheus-k8s
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:293 msg="no time or size retention was set so using the default time retention" duration=15d
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.12.0, branch=HEAD, revision=43acd0e2e93f9f70c49b2267efa0124f1e759e86)"
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:330 build_context="(go=go1.12.8, user=root@7a9dbdbe0cc7, date=20190818-13:53:16)"
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:331 host_details="(Linux 5.4.43 #1 SMP Mon Jun 1 17:26:49 UTC 2020 x86_64 prometheus-0 (none))"
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-02-17T07:17:18.183Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-02-17T07:17:18.185Z caller=main.go:654 msg="Starting TSDB ..."
level=info ts=2021-02-17T07:17:18.185Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-02-17T07:17:18.189Z caller=head.go:509 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2021-02-17T07:17:18.190Z caller=head.go:557 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2021-02-17T07:17:18.190Z caller=main.go:669 fs_type=EXT4_SUPER_MAGIC
level=info ts=2021-02-17T07:17:18.190Z caller=main.go:670 msg="TSDB started"
level=info ts=2021-02-17T07:17:18.190Z caller=main.go:740 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-02-17T07:17:18.191Z caller=main.go:768 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-02-17T07:17:18.191Z caller=main.go:623 msg="Server is ready to receive web requests."
level=error ts=2021-02-17T07:17:23.192Z caller=scrape.go:352 component="scrape manager" scrape_pool=prometheus msg="creating targets failed" err="instance 0 in group 0: no address"

It seems that it is not correctly configured.

egranell commented 3 years ago

Solved by adding the default target: https://github.com/mlopsworks/charms/blob/2609c7ee817361076821f7d5750e41a84ba31929/bundle.yaml#L18

egranell commented 3 years ago

I have not been able to get to see the prometheus graphical interface, not even doing a port redirection as we did for mlflow, minio and kubeflow. But when requesting the targets page of the prometheus server (curl --request GET "10.103.173.33:9090/targets" ) within the cluster we can see that the scrapping is working without errors:

   <tbody>
          <tr>
            <td class="endpoint">
              <a href="http://mlflow:5000/metrics">http://mlflow:5000/metrics</a><br>
            </td>
            <td class="state">
              <span class="alert alert-success state_indicator text-uppercase">up</span>
            </td>
            <td class="labels">
              <span class="cursor-pointer" data-toggle="tooltip" title="" data-html=true data-original-title="<b>Before relabeling:</b><br>__address__=&quot;mlflow:5000&quot;<br>__metrics_path__=&quot;/metrics&quot;<br>__scheme__=&quot;http&quot;<br>job=&quot;prometheus&quot;">
                  <span class="badge badge-primary">instance="mlflow:5000"</span>
                  <span class="badge badge-primary">job="prometheus"</span>
              </span>
            </td>
            <td class="last-scrape">1.261s ago</td>
            <td class="scrape-duration">4.751ms</td>
            <td class="errors"></td>
          </tr>
        </tbody>