feat: Collect metrics in the environment and build monitor dashboard

VoVAllen commented 1 year ago

Description

Support collecting related metrics (cpu/gpu/memory/disk) and demonstrate it to users

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

aseaday commented 1 year ago

I think we could do a exporter. But we need to have a good design. Because I think we should reduce the difficulty for users to ananlyze the data but could know what he could do to improve the effiency or progress in a look. Another thing I cared about is that should we also monitor the model or computing flow in the application users runs?

VoVAllen commented 1 year ago

Sure. We're considering something like a probe, that can also be integrated with the existing observability platform.

For the second question, what kind of monitoring are you referring to?

aseaday commented 1 year ago

Let me state it clearly. For the first quesion, I think we should better also provide a simple all-in-one index such as a user's computing device is busy or free. For the second question, I developped some tools like Jaeger to trace our machine learning application's work flow. For a app running in envd, could we provide a python agent or sidecar to detect the computiing ops in some famlilar framework. Or we just offer a standard format and display the computing operator flow but users could develop the probe to find the computing graph for the framework they are using.

VoVAllen commented 1 year ago

@aseaday I think generating a computation graph from a running application is still an open problem, which cannot be done in a non-intrusive way. An alternative way is probably using nsight system to collect running metrics (we can provide envd script for this), that user can see the execution time of each operator. Does this work in your scenario?

aseaday commented 1 year ago

Nsight works for me but seems a little complex for ordinary data scientist like our target audience. Maybe we should have a talk with our target users. What they want to learn from a minitor?

gaocegege commented 1 year ago

Yep, we can. But I am not sure if they can tell us about it. They may not know what metrics are helpful for them too.

gaocegege commented 1 year ago

Some research: https://github.com/tensorchord/envd/issues/218

aseaday commented 1 year ago

Some research: #218

agent is better for general purpose use.

aseaday commented 1 year ago

I thought we should a better design for how to preprocess and show the metrics to the users. Otherwise, it is not meaningful to give a grafana.

gaocegege commented 1 year ago

Yep, I think so. We need a proposal for the feature.

aseaday commented 1 year ago

I will write the proposal and do basic exporter this week.

zwpaper commented 1 year ago

why not do our jobs based on the https://github.com/prometheus/node_exporter?

aseaday commented 1 year ago

Good question. I am thinking a lot about observability during work on top command. We could divided the metrics tools into there types:

frontend: grafana and top
intermiddiate layer: promotheus or some simple memory storage
backend: collectors like exporter

Expoter is a good mechinism in promotheus environment. Buf for envd users, a hard problem is that:

How much effort it cost to run and learn a grafana or promotheus

Our users may only need a few simple indicators and quickly know the processing status. So the hard part is not how to collect all metrics as many as possible. It is about:

What metrics are important to envd users
How to show them easily

Hope it could be helpful for you about why I give up the exporter when writing top.

zwpaper commented 1 year ago

@aseaday thanks so much for the quick reply! I totally get your point for writing a brand new top for envd.

but Prometheus stack also has some attractive advantages, like:

metrics history
UI would be much more user-friendly
extendability with many exporters already existed

the main concern is that it may cost too many resources and the learning curve, but

we have a proposal for k8s https://github.com/tensorchord/envd/pull/303, it may be a good choice if we have an existed Prometheus and Grafana to use.

what's more, we are considering the Observability https://github.com/tensorchord/envd/issues/151, then I long time storage like prometheus would be much more useful

gaocegege commented 1 year ago

I think prom works for us while we may need our own exporter. Because we may want to collect more metrics other than GPU/CPU hardware metrics.

Thus I am thinking if we should introduce a separate daemon process in the container to act as the exporter.

aseaday commented 1 year ago

@zwpaper @gaocegege Before we introduce prom into our user flow. The cost and method of bootstrap prom should be considered. Should we start a prom daemon? And where the prom should be running, docker/runc/k8s?. We know our users may don't develop apps on server and close their compute/laptop off work. It is also something prom may be not designed for.

As for exporter, I support exporter as a metrics collector inside the container for indicator can't be collected from the container runtime endpoint. Otherwise, we should merge the exporters from all containers and export a unified exporter or auto register to prom. But for k8s runtime, I am not sure the latter could be possible for runtime like k8s.

zwpaper commented 1 year ago

actually, it we do not care about the HA part, which a dev env should be ok without HA, a prom could be easily run with one command. the resources cost would sometimes be a problem, but if we consider the use case, that users run there envs in cloud or shared a bear metal, the cost may become not that heavy.

as for exporter part, exporter is designed to be low cost, it should be ok to run one in each container, and this would make the problem easy to solve.

PS, a brain storm came up, should envd develop something like docker-compose to create multiple containers in a time😂

zwpaper commented 1 year ago

we may need our own exporter.

@gaocegege totally agree, this is why I said that built one based on node exporter.

zwpaper commented 1 year ago

BTW, @aseaday although the discussion focused on the prom part, top is great! and it should always be one of the tools we present to users.

just thinking about whether can we do something more.

aseaday commented 1 year ago

BTW, @aseaday although the discussion focused on the prom part, top is great! and it should always be one of the tools we present to users.

just thinking about whether can we do something more.

the current framework is open to use prom. But we could add prom daemon in bootstrap first.

zwpaper commented 1 year ago

I can handle the prom related, but we must stay on the same page before starting. I will open a proposal later to describe and discuss how we do prom

gaocegege commented 1 year ago

@zwpaper Maybe we can discuss it further in discord to make sure that we are on the same page.

kemingy commented 1 year ago

I would recommend https://github.com/VictoriaMetrics/VictoriaMetrics as a way to store the metrics.

But I'm still not very sure if all the metrics we need can be stored in a way like Prometheus.

gaocegege commented 1 year ago

I think it should be push-based in envd. I do not know if prom push-gateway is mature now.

gaocegege commented 1 year ago

@zwpaper Would you mind joining our discord and discuss it in #envd-dev?

gaocegege commented 1 year ago

https://discord.com/invite/KqswhpVgdU

gaocegege commented 1 year ago

PS, a brain storm came up, should envd develop something like docker-compose to create multiple containers in a timejoy

It works if the "sidecar" container shares the same PID namespace with the envd containers. Then we can get the process info in it.

kemingy commented 1 year ago

Also related to #13

zwpaper commented 1 year ago

Wow! @kemingy thanks so much for bringing VictoriaMetrics/VictoriaMetrics, it really seems to be a great project and a good replacement for prom!

aseaday commented 1 year ago

It is the whole landscape I think the envd need. There are some point need to watch out.

As @kemingy said, there is not only time series numberic data to be collected. The data may like graph to describe the computing flow shoud also be considered.
I still think we could give a analytic result for users to know simplily such as the disk IO is slower and influence the overall performance so we need a analytics feature too.

tensorchord / envd

feat: Collect metrics in the environment and build monitor dashboard #669

Description