k8ssandra / management-api-for-apache-cassandra

RESTful / Secure Management Sidecar for Apache Cassandra
Apache License 2.0
69 stars 51 forks source link

Integrate the Pyroscope agent in the Cassandra/DSE builds to enable continuous profiling #462

Open adejanovski opened 3 months ago

adejanovski commented 3 months ago

Flamegraphs are often the best (if not the only) way to properly identify what's causing performance issues in Cassandra. Grafana Pyroscope is a continuous profiling database which allows displaying flamegraphs in Grafana and would be a great addition to our toolbelt.

We should add the pyroscope java agent to our builds, which we'd disable by default (see the PYROSCOPE_AGENT_ENABLED env variable) and fully configure it through env variables.

### Definition of Done
- [ ] The Pyroscope agent is added to our builds and disabled by default
burmanm commented 3 months ago

I don't think this provides user anything interesting. What on earth would users do with thread profiling of Cassandra? It doesn't reveal much of useful information even, given how Cassandra is architected.

If the user is a Cassandra developer, then perhaps they might get something useful out of it, but not otherwise.

adejanovski commented 3 months ago

It doesn't reveal much of useful information even, given how Cassandra is architected

My experience with diagnosing Cassandra performance issues contradicts this. It is VERY useful. It can tell you if compaction is killing your performance, if it's GC, if it's tombstones, etc... In cases where metrics and logs are misleading.

Miles-Garnsey commented 3 months ago

Seconded, I've also used flame charts to diagnose performance problems.

My only reservation with this is that I think we'd want to have a good understanding of any performance impacts caused by running tracing continuously. It might be more interesting to sample traces periodically.

NB: if we had a service mesh we could be examining network traces too, which would possibly be even more useful...

adejanovski commented 3 months ago

My only reservation with this is that I think we'd want to have a good understanding of any performance impacts caused by running tracing continuously. It might be more interesting to sample traces periodically.

yeah, the impact of the continuous profiling needs to be evaluated. I guess we can tune the profiling intervals to avoid profiling all the time.

NB: if we had a service mesh we could be examining network traces too, which would possibly be even more useful...

The service mesh is something we should explore to see what benefits we could get out of it (easy TLS orchestration being one) and what it would impose us as drawbacks (higher latencies being one).