taskcluster / taskcluster-rfcs

Taskcluster team planning
Mozilla Public License 2.0
11 stars 20 forks source link

Support for task-level metrics #85

Open indygreg opened 7 years ago

indygreg commented 7 years ago

"Collect metrics from automation" is a wheel that we keep reinventing. The following are used in Firefox CI:

By implementing metrics collection within tasks, we frequently deal with the following problems:

At the very least, I think TC should report resource utilization for tasks. Wall time. CPU time. Average CPU utilization. I/O counters. Maximum memory utilization. Etc. It doesn't have to be consistent across platforms. Report when you can easily and without a significant probe overhead and we can iterate from there.

I think it would be really rad if TC could recognize metrics data from special syntax in task output. For example, if a task emitted lines with BEGIN_PHASE foo and END_PHASE foo, TC could record the times of various phases and then use that for correlating to resource utilization, displaying timelines of events, etc. This would allow all tasks to code to a universal "metrics language" and metrics would "just work."

Random technical thoughts:

CC @luser

djmitche commented 7 years ago

Where would you like this data to end up, ultimately? Perfherder?

indygreg commented 7 years ago

Perfherder would be a good initial repository for some data. But Perfherder is aimed at tracking specific per-repository metrics over repository time. It can't do things like track aggregate counts of events across all tasks. (Maybe it can in the database. But the UI is heavily tailored towards things like Talos results.) I feel like we're abusing Perfherder for things like tracking build times and compiler warnings. When all you have is a hammer...

indygreg commented 7 years ago

FWIW, I have a half-concocted patch to add some really hacky output parsing to run-task. For the unaware, run-task is a (currently minimal) Python script that we attempt to make the entrypoint of many Firefox tasks. It essentially handles filesystem permission normalization (for caches), permissions dropping, and VCS checkout. If psutil were available to run-task (easy enough: add it to the base image) and all tasks used run-task, we could get what we are proposing building here.

But with an in-task solution like run-task you still need to standardize on using run-task everywhere, need to put in the effort to make things like psutil work on all images, need to worry about updating run-task whenever it changes, etc. At the point it is an ubiquitous and highly-used feature, it becomes a candidate for a built-in feature in the TaskCluster platform.