Support for task-level metrics

indygreg commented 7 years ago

"Collect metrics from automation" is a wheel that we keep reinventing. The following are used in Firefox CI:

The Firefox build system uses psutil to monitor CPU, I/O, memory, etc throughout the build. Samples are taken every ~1.0s and recorded. The build system also emits special syntax log messages to denote "phases" and "events" so their times and resource usage can be isolated. This can emit a JSON file with raw results. Very crude displaying of the data is possible.
mozharness integrates with the same psutil-based Python package for monitoring mozharness steps.
Any task can emit PERFHERDER_DATA special syntax log lines that get picked up Treeherder's log ingestion system. The raw data gets exposed on Perfherder.

By implementing metrics collection within tasks, we frequently deal with the following problems:

Bootstrapping psutil. It uses a binary C Python extension. Getting that working on some machines is painful.
Bucketing results. For example, build resource metrics are bucketed by worker name and EC2 instance type. If we don't do this, we get bi-modal data for varying EC2 instances or workers. (This is also a bit annoying because it is difficult to easily compare properties of different EC2 instances in Perfherder since they are separate data sets and you can't easily use wildcards.)
No unified metrics reporting mechanism. The PERFHERDER_DATA hack is the closest thing we have. We keep writing tools that watch things and emit PERFHERDER_DATA.
Event counts are hard. We can't easily get counts of specific events. You have to have a persistent process watching things. That process needs to do real-time parsing for events, aggregate them in memory, then emit a PERFHERDER_DATA blob at the end.
Hard to get metrics at beginning and end of tasks. Things that run before mozharness (like VCS operations) and after it (task cleanup) are essentially black holes when it comes to metrics data. We don't really know what we're doing and how (in)efficient it is.
Recovering from errors is hard. If a task encounters an error, it is very easy for the metrics data to go up in smoke.

At the very least, I think TC should report resource utilization for tasks. Wall time. CPU time. Average CPU utilization. I/O counters. Maximum memory utilization. Etc. It doesn't have to be consistent across platforms. Report when you can easily and without a significant probe overhead and we can iterate from there.

I think it would be really rad if TC could recognize metrics data from special syntax in task output. For example, if a task emitted lines with BEGIN_PHASE foo and END_PHASE foo, TC could record the times of various phases and then use that for correlating to resource utilization, displaying timelines of events, etc. This would allow all tasks to code to a universal "metrics language" and metrics would "just work."

Random technical thoughts:

Parsing task output for special metrics syntax is the simplest for tasks. Anyone can print to stdout, right? But it isn't the most robust. Data isn't structured. Text parsing is relatively expensive. A better solution would be to make a TCP socket or pipe available to the task where it could send metrics. Feel free to reuse the protocol from any number of system monitoring tools (like collectd) here.
The major missing feature from Firefox CI today is the ability to easily get counts. We can report times and scalar data via Perfherder decently enough. But it is hard to answer questions like "how many tasks are performing a full VCS clone vs an incremental pull" and "how often do we have an HTTP failure against S3?"
There is a non-trivial amount of code in mozharness and Treeherder devoted to parsing output for interesting patterns. Having the ability for a task to self-identify interesting patterns (e.g. abort:\s.*) and then having the worker managing that task parse for these and treat them specially is an interesting idea. It allows you to do things like automatically set anchors to "interesting" parts of logs and to create metrics from specific output patterns. The latter is useful when you don't have control over process output and need to invent a metrics signal from non-structured output.

CC @luser

djmitche commented 7 years ago

Where would you like this data to end up, ultimately? Perfherder?

indygreg commented 7 years ago

Perfherder would be a good initial repository for some data. But Perfherder is aimed at tracking specific per-repository metrics over repository time. It can't do things like track aggregate counts of events across all tasks. (Maybe it can in the database. But the UI is heavily tailored towards things like Talos results.) I feel like we're abusing Perfherder for things like tracking build times and compiler warnings. When all you have is a hammer...

indygreg commented 7 years ago

FWIW, I have a half-concocted patch to add some really hacky output parsing to run-task. For the unaware, run-task is a (currently minimal) Python script that we attempt to make the entrypoint of many Firefox tasks. It essentially handles filesystem permission normalization (for caches), permissions dropping, and VCS checkout. If psutil were available to run-task (easy enough: add it to the base image) and all tasks used run-task, we could get what we are proposing building here.

But with an in-task solution like run-task you still need to standardize on using run-task everywhere, need to put in the effort to make things like psutil work on all images, need to worry about updating run-task whenever it changes, etc. At the point it is an ubiquitous and highly-used feature, it becomes a candidate for a built-in feature in the TaskCluster platform.

taskcluster / taskcluster-rfcs

Support for task-level metrics #85