Open indygreg opened 7 years ago
Where would you like this data to end up, ultimately? Perfherder?
Perfherder would be a good initial repository for some data. But Perfherder is aimed at tracking specific per-repository metrics over repository time. It can't do things like track aggregate counts of events across all tasks. (Maybe it can in the database. But the UI is heavily tailored towards things like Talos results.) I feel like we're abusing Perfherder for things like tracking build times and compiler warnings. When all you have is a hammer...
FWIW, I have a half-concocted patch to add some really hacky output parsing to run-task
. For the unaware, run-task
is a (currently minimal) Python script that we attempt to make the entrypoint of many Firefox tasks. It essentially handles filesystem permission normalization (for caches), permissions dropping, and VCS checkout. If psutil
were available to run-task
(easy enough: add it to the base image) and all tasks used run-task
, we could get what we are proposing building here.
But with an in-task solution like run-task
you still need to standardize on using run-task
everywhere, need to put in the effort to make things like psutil
work on all images, need to worry about updating run-task
whenever it changes, etc. At the point it is an ubiquitous and highly-used feature, it becomes a candidate for a built-in feature in the TaskCluster platform.
"Collect metrics from automation" is a wheel that we keep reinventing. The following are used in Firefox CI:
PERFHERDER_DATA
special syntax log lines that get picked up Treeherder's log ingestion system. The raw data gets exposed on Perfherder.By implementing metrics collection within tasks, we frequently deal with the following problems:
PERFHERDER_DATA
hack is the closest thing we have. We keep writing tools that watch things and emitPERFHERDER_DATA
.PERFHERDER_DATA
blob at the end.At the very least, I think TC should report resource utilization for tasks. Wall time. CPU time. Average CPU utilization. I/O counters. Maximum memory utilization. Etc. It doesn't have to be consistent across platforms. Report when you can easily and without a significant probe overhead and we can iterate from there.
I think it would be really rad if TC could recognize metrics data from special syntax in task output. For example, if a task emitted lines with
BEGIN_PHASE foo
andEND_PHASE foo
, TC could record the times of various phases and then use that for correlating to resource utilization, displaying timelines of events, etc. This would allow all tasks to code to a universal "metrics language" and metrics would "just work."Random technical thoughts:
abort:\s.*
) and then having the worker managing that task parse for these and treat them specially is an interesting idea. It allows you to do things like automatically set anchors to "interesting" parts of logs and to create metrics from specific output patterns. The latter is useful when you don't have control over process output and need to invent a metrics signal from non-structured output.CC @luser