Appetite for a query language?

zecke commented 2 years ago

One use-case I would like to experiment with is to be able to answer questions across a larger set of deployments to drive optimization efforts. Some of the queries might be along the lines of:

Which binary is the most "expensive" (most CPU, most memory, highest rate allocation) over the last month/day?
Which function is the most "expensive" (most CPU, most memory, highest rate of allocation) over the last month/day? Or narrow this down to to binary (e.g. most expensive function of this binary).
Who could benefit most of a Golang allocator change?
What function to optimize in my library?

The result will be a flat report and not a flamegraph. I wondered whether to approach this by introducing a query language? This requires more thought but on a high-level something like this:

topk(10, merge by (binary) (cpu_profile{binary="frx"}[28d])
topk(10, merge by (binary, function) (allocations{job="abc"}[1d])

Or something more advanced like finding the binaries that allocate most memory in a specific function?

topk(10, merge by (binary) (select(allocations{job="abc"}, {function=~."*runtime.malloc.*"})[28d]))

thorfour commented 2 years ago

Yes! This is definitely something we want to add. Thanks for opening up an issue about it as it's definitely something we need to track.

metalmatze commented 2 years ago

As there are many things we don't understand yet, we want to write an in-depth design doc discussing various details for a query language in the next months.

For now, we are probably going to focus on persistent storage a bit more. Still, we really want this!

brancz commented 2 years ago

Fun fact, we already have a language and a parser, it's just super small right now. It's how autocompletion works today. I always imagined there to be an "advanced" mode that was just a plain query input a la Prometheus, that doesn't use any of the guiding UI elements.

Let's use this issue as a place to collect use cases. My top use case that I would like to see, that I cannot do today: I already know the function name of the function that I want to optimize (for example through distributed tracing), so I want to see all data merged that includes traces that include that function, visualized as a flamegraph.

brancz commented 2 years ago

Raw thoughts and it's perfectly possible I'm completely wrong (thoughts still developing): I think function selection should be a secondary filter of some sort. My thinking is so we can do something like:

merge(cpu{job="abc", version="v0.1.0"}) - merge(cpu{job="abc", version="v0.1.1"}) | function="functionThatITriedToOptimize"

(the - would be a diff, because that's what it effectively is, though maybe it should be a function to distinguish absolute and relative diffs)

Not saying that I necessarily like this notation, but I think it demonstrates why I think it should be a "second step" filter.

yeya24 commented 2 years ago

One thing I found hard to understand is the Query data model in Parca. query_range API returns some metrics series but query API only returns one profile (or one merged profile).

In this case, what's the meaning of cpu_profile{binary="frx"}[28d]? Is this range of metrics or the merged profile?

brancz commented 2 years ago

Yeah, I think it can be confusing because query_range and query don't have the same relationship as in Prometheus, but I do think the query_range possibilities will change quite a bit. I imagine being able to visualize the top_k stack traces over time so that the current query_range will actually become sum(<current-query-selector>).

zecke commented 2 years ago

Makes sense. One additional use-case might be release qualification/roll-out qualification. This might be a bit far fetched but in a canary judge I would like to know if the canary is (significantly) less efficient than before (or the other running tasks).

Questions: What is efficient? Number of samples? Can one weight it? Ideally something like (averaged) cost per query (might need to combine parca and prometheus) over period of time?

brancz commented 2 years ago

I'd like to think we can get quite far knowing the duration, period and samples and using that for relative comparisons, but I agree the moment where the canary is not an equal participant in the system it gets significantly harder to judge. I think the need for weighting is inevitable.

brancz commented 2 years ago

I think some things are starting to crystalize for me. Primarily that the language should evolve around selection, aggregation, and manipulation of stack traces, as opposed to thinking of "profiles" as a unit (stack traces that have a selector attached to them are instead the unit).

If we think of it in that way, there is no more merging or no merging, everything becomes an aggregation of stack traces, and this can be either at a specific point in time or across time. Happy little accident that so far that's how the selectors happened to also work.

A couple of things in addition to what I think we need to be able to express (and some of these need to be changed in the general UX of querying, not just a query language but I think it goes hand in hand):

Select stack traces regardless of time
- I think this covers the previously mentioned canary use case super well, we don't care over which timeframe a comparison is made, we care across which versions (in the canary case)
Last stack traces seen with selector
- eg. "Show me the last stack traces of heap before the process ended (for example through an out-of-memory error)
Select only a subset of stack traces within profiles originating from targets
- Select by pprof labels
- Select by function/mapping/location being included in the stack trace
- Select by function/mapping/location being ordered in a specific way, eg. main() calling expensiveLoop1()
Cutting stack traces starting at a specific location and using the location that was cut at as the new root
- Takes a location search as parameters
Flipping stack traces, eg. to find out which call-chains caused a certain function to cause CPU cycles (or other measurements the most)

Any combination of these should be diff-able against each other.

sudeep-ib commented 2 years ago

Agree with all of the above ^^

I would also love to also see how Parca query language can be used to- [1] provide ability to write rules that can be used for generating alerts as well. [2] ability to use the query language over a plugin from grafana for comparing existing stats (like CPU, memory latencies) together with selected profile information time series

let me know if that does not make sense :)

brancz commented 2 years ago

1) Could you explain what kind of alerting would make sense to you? 2) Makes perfect sense to me, I think we're just still trying to figure out the query patterns and UX before we start integrations into other systems which just makes maintenance harder.

sudeep-ib commented 2 years ago

I think we're just still trying to figure out the query patterns and UX before we start integrations into other systems which just makes maintenance harder.

Yes @brancz - that makes sense! I was suggesting this as something we can consider in the mid-term as the project matures. There may also be a case here to see how other tools like Grafana might be open to extending in this direction as well to complement Parca's ability to be a great datastore. (profiling can be a great add-on there from their POV too).

I have just started to use Parca here - so take my suggestions with a grain of salt :)

On [1] my thought was we could look at ability to measure things like time spent in mutex contention or locks or ttot (in python) spent on a fn over some cycles. We could use this together with alerts to highlight regressions or some bad state that the code lead into. We will have more concrete ideas here as we start using this more!

metalmatze commented 1 year ago

@javierhonduco and I just had a conversation about the use case for https://github.com/parca-dev/parca-agent/pull/1001 Essentially it boils down to querying the time percentage spent in a specific function over time. In the end, a time series showing the percentage over 2 or 4 weeks would show the continued effort in performance improvements.

  sum(parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca", function="debug/elf.Open"}) 
/
  sum(parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca"})

  sum(parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca", function=~"debug/elf.*"}) 
/
  sum(parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca"})

  sum by(rollout) (parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca", function=~"debug/elf.*"}) 
/
  sum by(rollout) (parca_agent_cpu:samples:count:cpu:nanoseconds{job="parca"})

parca-dev / parca

Appetite for a query language? #284