[RFC 160] Process monitoring

escapewindow commented 4 years ago

We may want to have a worker setting to disable this on hardware performance test pools.

glandium commented 4 years ago

It could be interesting to have a way for the task itself to give some form of markers. We have that for build resource usage, and allows us to know what part of the build is being processed.

srfraser commented 4 years ago

It could be interesting to have a way for the task itself to give some form of markers. We have that for build resource usage, and allows us to know what part of the build is being processed.

I couldn't think of a straightforward way to add this without some awkward IPC going on, although I'm open to suggestions. As far as I'm aware the payload itself is always in a state where it could be run outside a task and so its external communication is fairly generic: artifacts, logging, any necessary downloads.

We do already process the live log and produce timing information for each phase, so it's possible to link these up afterwards. For example for AUmp7fbpRmWB_Rniw0s5eA we have component and subcomponents discovered:

component,subComponent task,total taskcluster,setup taskcluster,task vcs,update mozharness,get-secrets,step mozharness,build,step build_metrics,configure build_metrics,pre-export build_metrics,export build_metrics,compile build_metrics,misc build_metrics,libs build_metrics,tools build_metrics,package-generated-sources build_metrics,package build_metrics,upload taskcluster,teardown

srfraser commented 4 years ago

We may want to have a worker setting to disable this on hardware performance test pools.

It's certainly an option. The existing mixin-based resource monitoring is already running for these tasks as it doesn't differentiate, but that doesn't mean it's always a good thing

srfraser commented 4 years ago

From my understanding so far:

We're leaning towards a separate tool that the worker understands how to interact with, partly for cross-worker issues and partly for ease of third-party taskcluster deployments. This also answers 'how to turn it off on some worker types' since we just don't deploy the binary.
Some output format changes to do with version and field types
Some questions about what would be monitored differently in docker-worker need more discussion
Data retention and storage costs need to be addressed - if the artifact is compressed this is mitigated

What have I missed?

catlee commented 4 years ago

I really doubt we need to store 1s resolution resource data. Probably every 15s or 30s is sufficient for most of our use cases.

petemoore commented 4 years ago

Simon, Mihai and I met today, and on reflection we think there may be more flexibility afforded and fewer barriers to delivery if indeed the metrics are collected as part of the run-task machinery.

Simon had the great idea of writing the tool in go and shipping as a standalone statically linked executable per platform. This could be mounted in place by run-task like it mounts toolchains, and the go source code could live in tree. If that source code changes, it would automatically be rebuilt and run-task would automatically get the latest built version of the tool.

This offers some advantages:

changes to the profiling tool can be made with a try push
there is full transparency of the tool's existence and function in the task
there are no barriers to delivery - no deployment operations required, i.e. no person needs to wait on the actions of another person, to roll this out (except code reviews, of course, outside of try)
the tool can be easily enabled/disabled per platform/project/branch/task type/worker type etc from in-tree configs
the (go) tool can be shared by other projects

My feeling is that the advantages of making it part of the platform would be if there was custom behaviour was highly tied to the cloud environment that the task runs in (e.g. different behaviour required for gcp/aws/azure workers etc), or if the metrics collection needed to run with higher privileges than the task, or if the tool would be difficult to share between projects but was generic enough that we'd want to have it running everywhere, without a lot of customisation, indefinitely, with no anticipated change to the metrics collection over time. I don't think any of these conditions apply.

So my vote is that we do this in-task, set up the go tool as a toolchain that gets built by the existing toolchain building mechanics we have in firefox, put all the code in mozilla-central, and make it transparent and self-serve for devs.

glandium commented 4 years ago

There's a chicken and egg problem with that approach though: toolchain tasks use run-task. I guess we could just say that we don't care about the stats in that case. The other problem is that docker-worker doesn't support mounts, so any change to the go program would need to be baked into docker images, alongside run-task... which is another chicken and egg problem because the docker image that is used to build toolchains itself will need to have it baked in. Well, I guess we could build the go program with an external docker image... (if docker-worker could support mounts, that would be even better)

srfraser commented 4 years ago

There's a chicken and egg problem with that approach though: toolchain tasks use run-task. I guess we could just say that we don't care about the stats in that case.

Hm, a circular dependency on the monitoring, but yes, as you suggest it may not have a huge resource impact.

The other problem is that docker-worker doesn't support mounts, so any change to the go program would need to be baked into docker images, alongside run-task... which is another chicken and egg problem because the docker image that is used to build toolchains itself will need to have it baked in. Well, I guess we could build the go program with an external docker image... (if docker-worker could support mounts, that would be even better)

Using fetches ensures run-task just downloads the required artifacts using HTTPS, so that could be a reasonable alternative to mounts.

glandium commented 4 years ago

But then you don't do resource monitoring for fetches and everything that precedes it. (I think we clone before fetches)

djmitche commented 3 years ago

closing as stale, but still here for reference / revival

taskcluster / taskcluster-rfcs

[RFC 160] Process monitoring #160