grafana / pyroscope

Continuous Profiling Platform. Debug performance issues down to a single line of code
https://grafana.com/oss/pyroscope/
GNU Affero General Public License v3.0
10.07k stars 608 forks source link

periodicTask take up a lot of CPU time. #294

Closed hiyanxu closed 3 years ago

hiyanxu commented 3 years ago

image The pyroscope will take up a lot of CPU time when I start up the pyroscope-server. By CPU ppof I found that periodicTask take up a lot of CPU time. How can it be optimized?

kolesnikovae commented 3 years ago

Thank you for reporting the issue!

Badgerdb GC call may be quite expensive. Recently, we made some improvements that should alleviate this problem, but those have not been released yet.

Could you please clarify the following:

  1. Pyroscope version.
  2. DB size on disk (e.g. du -sh).
  3. What is the pyroscope server CPU usage from OS perspective (%)?
  4. Does the problem persist during the server run or you observe this only at start?
hiyanxu commented 3 years ago

Thanks for your attention.

We integrated the pyroscope-server into a platform called bianque. So the pid of bianque is: image

Pyroscope version: lastest branch main

DB size on disk: image

CPU usage: The machine is 8 vCPU 32 GiB. image Sometimes CPU usage will be 150%.

Not only started, the problem persist during the server run.

Is it my configuration error?

In the production environment(The machine is 2 vCPU 4 GiB), CPU usage is: image

hiyanxu commented 3 years ago

Is this a bug? Why call RunValueLogGC in infinite loop? image

When i delete the infinite loop(for{}), the CPU pprof is: image

kolesnikovae commented 3 years ago

Thank you for the info, it's very useful.

Regarding the code snippet – this should not be a problem and the loop is not supposed to be an infinite one:

// If a call to RunValueLogGC results in no rewrites, then an ErrNoRewrite is
// thrown indicating that the call resulted in no file rewrites.

And from the official documentation:

This method is designed to do garbage collection while Badger is online. Along with randomly picking a file, it uses statistics generated by the LSM-tree compactions to pick files that are likely to lead to maximum space reclamation. It is recommended to be called during periods of low activity in your system, or periodically. One call would only result in removal of at max one log file. As an optimization, you could also immediately re-run it whenever it returns nil error (indicating a successful value log GC), as shown below.

Thus, once all value log files are GC'ed, the badgerGCTask call returns. Deleting the loop effectively disables GC (almost), which leads to database bloating.


I believe, it is very unlikely that this has anything to do with configuration. Could you also please send:

It would be also very helpful to take a look on memory consumption (it is surprisingly low judging by the screenshots) and pyroscope metrics: especially related to caching.

hiyanxu commented 3 years ago

pyroscope version: v0.0.35 config: image

Thanks again for your attention!

kolesnikovae commented 3 years ago

Thank you! What caught my eye is that cache eviction options are set to 0 – this may cause OOM kill. Defaults are:

CacheEvictThreshold: 0.25
CacheEvictVolume:    0.33 

I suspect that badger GC consumes so much CPU time because of too many badger key versions are created for cached items. Upcoming release (0.0.36) brings changes to LFU cache and persistence model, it should fix the issue. Before the release you can verify if the problem persists in HEAD (may be not stable).

hiyanxu commented 3 years ago

Thank you. I'll try.

hiyanxu commented 3 years ago

Thanks for your answer, I tried it. It was effective at the beginning, but the next day I found that it took up a lot of CPU, even more than 200%. I observed CPU pprof. The func runtime. gcBgMarkWorker is max. image I looked further at the number of memory inuse_objects, ingestHandler produced objects up to 98%. image

So is there a problem?

kolesnikovae commented 3 years ago

Thank you @hiyanxu, I very appreciate your help!

It was effective at the beginning, but the next day I found that it took up a lot of CPU, even more than 200%. I observed CPU pprof. The func runtime. gcBgMarkWorker is max.

I'm wondering if such a high utilisation is observed periodically or is constant. In Go, GC takes up to 25% of CPU, so that would mean that GC works all the time.

I looked further at the number of memory inuse_objects, ingestHandler produced objects up to 98%.

This is fine. Apart from data ingestion and queries, there should be no other causes for allocations.


Can we take a look at pyroscope server metrics?

hiyanxu commented 3 years ago

Ok, thank you. The first question:

I'm wondering if such a high utilisation is observed periodically or is constant. In Go, GC takes up to 25% of CPU, so that would mean that GC works all the time.

It grows gradually. The server has been running for 27 hours. The CPU is very high. image image image

And there is the metrics info:

/**
# HELP cache_dicts_hit 
# TYPE cache_dicts_hit counter
cache_dicts_hit 805831
# HELP cache_dicts_miss 
# TYPE cache_dicts_miss counter
cache_dicts_miss 5
# HELP cache_dicts_size 
# TYPE cache_dicts_size gauge
cache_dicts_size 5
# HELP cache_dimensions_hit 
# TYPE cache_dimensions_hit counter
cache_dimensions_hit 41215
# HELP cache_dimensions_miss 
# TYPE cache_dimensions_miss counter
cache_dimensions_miss 5
# HELP cache_dimensions_size 
# TYPE cache_dimensions_size gauge
cache_dimensions_size 5
# HELP cache_segments_hit 
# TYPE cache_segments_hit counter
cache_segments_hit 41215
# HELP cache_segments_miss 
# TYPE cache_segments_miss counter
cache_segments_miss 5
# HELP cache_segments_size 
# TYPE cache_segments_size gauge
cache_segments_size 5
# HELP cache_trees_hit 
# TYPE cache_trees_hit counter
cache_trees_hit 273748
# HELP cache_trees_miss 
# TYPE cache_trees_miss counter
cache_trees_miss 44213
# HELP cache_trees_size 
# TYPE cache_trees_size gauge
cache_trees_size 44213
# HELP cpu_utilization 
# TYPE cpu_utilization gauge
cpu_utilization 62.70136311206501
# HELP disk_dicts 
# TYPE disk_dicts gauge
disk_dicts 3.931704363e+09
# HELP disk_dimensions 
# TYPE disk_dimensions gauge
disk_dimensions 3.3228217e+07
# HELP disk_local_profiles 
# TYPE disk_local_profiles gauge
disk_local_profiles 0
# HELP disk_main 
# TYPE disk_main gauge
disk_main 4.3992523e+07
# HELP disk_segments 
# TYPE disk_segments gauge
disk_segments 3.088175454e+09
# HELP disk_trees 
# TYPE disk_trees gauge
disk_trees 1.824761403e+10
# HELP evictions_alloc_bytes 
# TYPE evictions_alloc_bytes gauge
evictions_alloc_bytes 3.727698824e+09
# HELP evictions_total_bytes 
# TYPE evictions_total_bytes gauge
evictions_total_bytes 3.302148096e+10
# HELP evictions_used_perc 
# TYPE evictions_used_perc gauge
evictions_used_perc 0.11288708790848852
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 5.9954e-05
go_gc_duration_seconds{quantile="0.25"} 6.7437e-05
go_gc_duration_seconds{quantile="0.5"} 7.4392e-05
go_gc_duration_seconds{quantile="0.75"} 9.0723e-05
go_gc_duration_seconds{quantile="1"} 0.004263853
go_gc_duration_seconds_sum 6.822021728
go_gc_duration_seconds_count 64026
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 205
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.14.15"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 3.65137924e+09
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 6.68894203056e+12
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 2.008782e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 3.4384101767e+10
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0.06964941632339433
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.79880832e+08
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 3.65137924e+09
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 3.828137984e+09
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 3.68631808e+09
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 6.7514228e+07
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 3.401736192e+09
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 7.514456064e+09
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.6273012078086815e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 3.4451615995e+10
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 13888
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 5.3498456e+07
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 6.0063744e+07
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 7.302589872e+09
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 7.637866e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.736704e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.736704e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.865800376e+09
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 19
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 218881.16
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 97
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.39545856e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.62720245611e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.3850661888e+10
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
# HELP storage_dicts_read 
# TYPE storage_dicts_read counter
storage_dicts_read 5
# HELP storage_dicts_write 
# TYPE storage_dicts_write counter
storage_dicts_write 63929
# HELP storage_dimensions_read 
# TYPE storage_dimensions_read counter
storage_dimensions_read 5
# HELP storage_dimensions_write 
# TYPE storage_dimensions_write counter
storage_dimensions_write 63929
# HELP storage_segments_read 
# TYPE storage_segments_read counter
storage_segments_read 5
# HELP storage_segments_write 
# TYPE storage_segments_write counter
storage_segments_write 63928
# HELP storage_trees_read 
# TYPE storage_trees_read counter
storage_trees_read 18
# HELP storage_trees_write 
# TYPE storage_trees_write counter
storage_trees_write 805818
# HELP write_back_count 
# TYPE write_back_count counter
write_back_count 63930
# HELP write_back_timer 
# TYPE write_back_timer gauge
write_back_timer 1.13790845e+09

*/

Thank you!

kolesnikovae commented 3 years ago

The number of GCs is definitely abnormal. Is it possible that GC is invoked explicitly (runtime.GC()) in the code (outside of pyroscope codebase)?

go_gc_duration_seconds_count 64026

Can I also take a look at pyroscope.server.alloc_space profile (from Pyroscope UI)? I was trying to reproduce the issue but couldn't. How many clients do you have? I've reproduced the problem, working on it.

hiyanxu commented 3 years ago

GC is not invoked explicitly in the code. pyroscope.server profile was disabled.

How many clients: Just one.

But more importantly, over time, the agent takes more and more CPU times. We have three servers. logic1 and logic2 was profiling by pyroscope, it consumes more CPU. image

Compare by Pyroscope UI. ProfileSession taskSnapshots() is taking more and more CPU. image

Server: 4vcpu8G. Config: DisableGCRuns: true SampleRate: 20

So, is it normal? Thank you!

kolesnikovae commented 3 years ago

Hi @hiyanxu, I'd like to let you know that I refactored some of problematic functions: https://github.com/pyroscope-io/pyroscope/pull/303 - the work is ongoing, thank you for your help! :)

Regarding takeSnapshots: its performance mostly determined by the runtime: the more goroutines exist and the bigger the heap, the more CPU time takeSnapshots call takes. However, the PR above should improve its performance. Anyway, the overhead of profiling in the picture is dumbfounding! It is supposed to be less than 5%.

hiyanxu commented 3 years ago

Ok, thank you for your attention! I'll try to find a solution again about takeSnapshots.

Thanks again.

kolesnikovae commented 3 years ago

I created a separate issue regarding Go GC CPU consumption https://github.com/pyroscope-io/pyroscope/issues/306. There is a way to temporary workaround the problem:

A workaround for deployed servers that experience the problem is setting very aggressive cache eviction with -cache-evict-threshold option, e.g, 0.1 or less (depending on the available memory): the option specifies percent of consumed RAM at which cached items are getting evicted, the amount of evicted items is controlled by -cache-evict-volume.

The initial problem with Badger GC CPU consumption fixed in 0.0.36 (not released yet), this issue is to be closed after the release.

kolesnikovae commented 3 years ago

We've released v0.0.36 with the fix and some optimizations that should resolve https://github.com/pyroscope-io/pyroscope/issues/307.

@hiyanxu, could you please confirm that the initial problem is solved.

hiyanxu commented 3 years ago

@kolesnikovae I'm sorry to tell you that I can't continue the original project because of some changes in my work. I have communicated with my original colleagues. If they continue the project, I will inform you whether it is effective. Sorry~