performancecopilot / pcp

Performance Co-Pilot
https://pcp.io
Other
974 stars 237 forks source link

pmdastatsd segfaults in pmdaRehash under a bigger load #2062

Closed pmoravec closed 2 months ago

pmoravec commented 2 months ago

When sending a few thousands of metrics via statsd to PCP, I get random segfault of pmdastatsd. I tuned my reproducer to hit this every 30 minutes on average. Reproducer requires foreman+katello+pulp installed, so hard to reproduce on your own, I think. But I can provide full 250MB coredump if required.

Backtrace of the segfaulting thread:

(gdb) bt full
#0  pmdaRehash (pmda=pmda@entry=0x55997bf54910, metrics=<optimized out>, nmetrics=<optimized out>) at open.c:519
        extp = 0x55997bf549e0
        hashp = 0x55997bf549f0
        metric = 0x7f2d2ba5e010
        buf = " \221\370{\231U\000\000\360\216\366{\231U\000\000\360#\371{\231U\000\000\205\230\230?-\177\000"
        m = 4992
#1  0x000055997a75926b in statsd_possible_reload (pmda=pmda@entry=0x55997bf54910) at pmda-callbacks.c:462
        data = 0x55997a961b00 <data>
        need_reload = <optimized out>
#2  0x000055997a759b4b in statsd_fetch (num_pm_id=340, pm_id_list=0x55997bfb452c, resp=0x7ffeb0b69b98, pmda=0x55997bf54910) at pmda-callbacks.c:685
        data = 0x55997a961b00 <data>
#3  0x00007f2d3f985d69 in __pmdaMainPDU (dispatch=0x7ffeb0b69c40) at mainloop.c:175
        pb = 0x55997bfb4510
        sts = 0
        psts = <optimized out>
        op_sts = 0
        pmid = 239076366
        desc = {pmid = 239076366, type = 5, indom = 239075330, sem = 1, units = {pad = 0, scaleCount = 0, scaleTime = 0, scaleSpace = 0, dimCount = 0, dimTime = 0, dimSpace = 0}}
        npmids = 340
        pmidlist = 0x55997bfb452c
        namelist = 0x0
        name = 0x55997bf89140 "\340\220\370{\231U"
        offspring = 0x0
        statuslist = 0x0
        subtype = 0
        result = 0x55997bfc6d20
        ctxnum = 0
        length = 0
        when = {tv_sec = 0, tv_usec = 0}
        ident = 239076700
        type = 6
        indom = 239075330
        inst = 0
        iname = 0x0
        inres = 0x55997bf92340
        labels = 0x0
        buffer = 0x55997a9619a0 <full_description> "Number of datagrams/packets that the agent has received\nduring its lifetime.\n"
        new_profile = 0x55997bfb8770
        profile = 0x55997bfb8770
        first_time = 0
        pmda = 0x55997bf54910
        pinpdu = <optimized out>
#4  0x000055997a752d48 in main_PDU_loop (dispatch=0x7ffeb0b69c40) at pmdastatsd.c:250
No locals.
#5  main (argc=3, argv=0x7ffeb0b69ee8) at pmdastatsd.c:417
        new_action = {__sigaction_handler = {sa_handler = 0x55997a75a380 <signal_handler>, sa_sigaction = 0x55997a75a380 <signal_handler>}, sa_mask = {__val = {0 <repeats 16 times>}}, 
          sa_flags = 536870912, sa_restorer = 0xa30}
--Type <RET> for more, q to quit, c to continue without paging--
        old_action = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 0, 0, 0, 1024, 0, 1725472837, 533999999, 1725472837, 533999999, 1725472837, 
              533999999, 0, 0, 0, 94117698027536}}, sa_flags = 0, sa_restorer = 0x0}
        sep = <optimized out>
        dispatch = {domain = 57, comm = {pmda_interface = 7, pmapi_version = 2, flags = 0}, status = 0, version = {any = {ext = 0x55997bf54910, profile = 0x7f2d3f981960 <pmdaProfile>, 
              fetch = 0x55997a759b20 <statsd_fetch>, desc = 0x55997a759870 <statsd_desc>, instance = 0x55997a759ae0 <statsd_instance>, text = 0x55997a7598a0 <statsd_text>, 
              store = 0x7f2d3f983530 <pmdaStore>}, two = {ext = 0x55997bf54910, profile = 0x7f2d3f981960 <pmdaProfile>, fetch = 0x55997a759b20 <statsd_fetch>, 
              desc = 0x55997a759870 <statsd_desc>, instance = 0x55997a759ae0 <statsd_instance>, text = 0x55997a7598a0 <statsd_text>, store = 0x7f2d3f983530 <pmdaStore>}, three = {
              ext = 0x55997bf54910, profile = 0x7f2d3f981960 <pmdaProfile>, fetch = 0x55997a759b20 <statsd_fetch>, desc = 0x55997a759870 <statsd_desc>, 
              instance = 0x55997a759ae0 <statsd_instance>, text = 0x55997a7598a0 <statsd_text>, store = 0x7f2d3f983530 <pmdaStore>}, four = {ext = 0x55997bf54910, 
              profile = 0x7f2d3f981960 <pmdaProfile>, fetch = 0x55997a759b20 <statsd_fetch>, desc = 0x55997a759870 <statsd_desc>, instance = 0x55997a759ae0 <statsd_instance>, 
              text = 0x55997a7598a0 <statsd_text>, store = 0x7f2d3f983530 <pmdaStore>, pmid = 0x55997a759b90 <statsd_pmid>, name = 0x55997a759be0 <statsd_name>, 
              children = 0x55997a759c30 <statsd_children>}, five = {ext = 0x55997bf54910, profile = 0x7f2d3f981960 <pmdaProfile>, fetch = 0x55997a759b20 <statsd_fetch>, 
              desc = 0x55997a759870 <statsd_desc>, instance = 0x55997a759ae0 <statsd_instance>, text = 0x55997a7598a0 <statsd_text>, store = 0x7f2d3f983530 <pmdaStore>, 
              pmid = 0x55997a759b90 <statsd_pmid>, name = 0x55997a759be0 <statsd_name>, children = 0x55997a759c30 <statsd_children>}, six = {ext = 0x55997bf54910, 
              profile = 0x7f2d3f981960 <pmdaProfile>, fetch = 0x55997a759b20 <statsd_fetch>, desc = 0x55997a759870 <statsd_desc>, instance = 0x55997a759ae0 <statsd_instance>, 
              text = 0x55997a7598a0 <statsd_text>, store = 0x7f2d3f983530 <pmdaStore>, pmid = 0x55997a759b90 <statsd_pmid>, name = 0x55997a759be0 <statsd_name>, 
              children = 0x55997a759c30 <statsd_children>, attribute = 0x7f2d3f9835a0 <pmdaAttribute>}, seven = {ext = 0x55997bf54910, profile = 0x7f2d3f981960 <pmdaProfile>, 
              fetch = 0x55997a759b20 <statsd_fetch>, desc = 0x55997a759870 <statsd_desc>, instance = 0x55997a759ae0 <statsd_instance>, text = 0x55997a7598a0 <statsd_text>, 
              store = 0x7f2d3f983530 <pmdaStore>, pmid = 0x55997a759b90 <statsd_pmid>, name = 0x55997a759be0 <statsd_name>, children = 0x55997a759c30 <statsd_children>, 
              attribute = 0x7f2d3f9835a0 <pmdaAttribute>, label = 0x55997a759c90 <statsd_label>}}}
(gdb)

Some variables from the backtrace:

(gdb) frame 0
#0  pmdaRehash (pmda=pmda@entry=0x55997bf54910, metrics=<optimized out>, nmetrics=<optimized out>) at open.c:519
519     if (__pmHashAdd(metric->m_desc.pmid, metric, hashp) < 0) {
(gdb) p metric
$20 = (pmdaMetric *) 0x7f2d2ba5e010
(gdb) p *metric
Cannot access memory at address 0x7f2d2ba5e010
(gdb) p hashp
$21 = (__pmHashCtl *) 0x55997bf549f0
(gdb) p *hashp
$22 = {nodes = 41434, hsize = 15360, hash = 0x55997bfd4120, next = 0x0, index = 0}
(gdb) list
514 
515     pmdaHashDelete(hashp);
516     for (m = 0; m < pmda->e_nmetrics; m++) {
517     metric = &pmda->e_metrics[m];
518 
519     if (__pmHashAdd(metric->m_desc.pmid, metric, hashp) < 0) {
520         pmNotifyErr(LOG_WARNING, "pmdaRehash: PMDA %s: "
521             "Hashed mapping for metrics disabled @ metric[%d] %s\n",
522             pmda->e_name, m,
523             pmIDStr_r(metric->m_desc.pmid, buf, sizeof(buf)));
(gdb) p m
$23 = 4992
(gdb) p *pmda
$24 = {e_flags = 14, e_ext = 0x55997bf549e0, e_sockname = 0x0, e_name = 0x7ffeb0b6a7bb "pmdastatsd", e_logfile = 0x55997bf548f0 "statsd.log", e_helptext = 0x0, e_status = 0, e_infd = 0, 
  e_outfd = 1, e_port = -1, e_singular = -1, e_ordinal = -1, e_direct = 0, e_domain = 57, e_nmetrics = 5091, e_nindoms = 3, e_help = -1, e_prof = 0x55997bfb8770, e_io = pmdaUnknown, 
  e_indoms = 0x55997bf57400, e_idp = 0x55997bf57400, e_metrics = 0x7f2d2ba37010, e_resultCallBack = 0x7f2d3f6e2900 <__pmFreeResultValues>, 
  e_fetchCallBack = 0x55997a759e60 <statsd_fetch_callback>, e_checkCallBack = 0x0, e_doneCallBack = 0x0, e_context = 0, e_endCallBack = 0x0, 
  e_labelCallBack = 0x55997a759ce0 <statsd_label_callback>}
(gdb) 

or:

(gdb) frame 1
#1  0x000055997a75926b in statsd_possible_reload (pmda=pmda@entry=0x55997bf54910) at pmda-callbacks.c:462
462         pmdaRehash(pmda, data->pcp_metrics, data->pcp_metric_count);
(gdb) p *(data->pcp_metrics)
$15 = {m_user = 0x55997a961ae0 <helper>, m_desc = {pmid = 239075328, type = 3, indom = 4294967295, sem = 3, units = {pad = 0, scaleCount = 0, scaleTime = 0, scaleSpace = 0, dimCount = 0, 
      dimTime = 0, dimSpace = 0}}}
(gdb) p data->pcp_metric_count
$16 = 5091
(gdb) p pmda
$17 = (pmdaExt *) 0x55997bf54910
(gdb) p *pmda
$18 = {e_flags = 14, e_ext = 0x55997bf549e0, e_sockname = 0x0, e_name = 0x7ffeb0b6a7bb "pmdastatsd", e_logfile = 0x55997bf548f0 "statsd.log", e_helptext = 0x0, e_status = 0, e_infd = 0, 
  e_outfd = 1, e_port = -1, e_singular = -1, e_ordinal = -1, e_direct = 0, e_domain = 57, e_nmetrics = 5091, e_nindoms = 3, e_help = -1, e_prof = 0x55997bfb8770, e_io = pmdaUnknown, 
  e_indoms = 0x55997bf57400, e_idp = 0x55997bf57400, e_metrics = 0x7f2d2ba37010, e_resultCallBack = 0x7f2d3f6e2900 <__pmFreeResultValues>, 
  e_fetchCallBack = 0x55997a759e60 <statsd_fetch_callback>, e_checkCallBack = 0x0, e_doneCallBack = 0x0, e_context = 0, e_endCallBack = 0x0, 
  e_labelCallBack = 0x55997a759ce0 <statsd_label_callback>}
(gdb) 

It seems tome the rehash was slowly happening at a time some other thread updated the hash table (the for (m = 0; m < pmda->e_nmetrics; m++) { was processing 4992th metric out of 5091)?

Relevant PCP version:

pcp-5.3.7-20.el8_10.x86_64
pcp-conf-5.3.7-20.el8_10.x86_64
pcp-libs-5.3.7-20.el8_10.x86_64
pcp-pmda-apache-5.3.7-20.el8_10.x86_64
pcp-pmda-openmetrics-5.3.7-20.el8_10.x86_64
pcp-pmda-postgresql-5.3.7-20.el8_10.x86_64
pcp-pmda-redis-5.3.7-20.el8_10.x86_64
pcp-pmda-statsd-5.3.7-20.el8_10.x86_64
pcp-selinux-5.3.7-20.el8_10.x86_64
pcp-system-tools-5.3.7-20.el8_10.x86_64
natoscott commented 2 months ago

@Erbenos any thoughts on this one?

One thing I noticed in a very brief look was that the crash is in a libpcp_pmda namespace routine & this PMDA is multi-threaded. It would be worthwhile auditing the code to check we cannot have different threads reading and writing global PMDA data structures concurrently, as this will cause problems (things like the namespace tree, the metrics and indom tables, etc are global state and need to be updated by either one thread only - the usual case - or under some lock protection, which the PMDA must provide)

Erbenos commented 2 months ago

It probably didn't occur to me that applies to pmdaRehash function at the time I was writing it.

Its clear that the pmdaRehash call is not guarded against race conditions in https://github.com/performancecopilot/pcp/blob/b71b90bc724337e4f3dae7be15690c98dc1d1886/src/pmdas/statsd/src/pmda-callbacks.c#L463

guarding mutex for that is the one in

https://github.com/performancecopilot/pcp/blob/b71b90bc724337e4f3dae7be15690c98dc1d1886/src/pmdas/statsd/src/pmda-callbacks.c#L455

From top of my head I think the statsd_possible_reload's pthread_mutex_unlock should move to the end of function and the locking/unlocking of given mutex in the statsd_map_stats (since its called by statsd_possible_reload) should be removed (since now the caller manages it). Same goes for other callee deeper down: reset_stat (EDIT: nevermind thats different mutex). That should avoid the race conditions that could have arisen above though now the locking would be much less granular.

Earliest I can do the change is saturday. Then I can also look into it some more, this is just some quick analysis.

Perhaps the author of the issue can try changing that code on their system and see if issue persists in their scenario? I can supply the diff/patch.

natoscott commented 2 months ago

@Erbenos thanks for taking a look!

| Earliest I can do the change is saturday.

That'd be great - no huge rush, though we have a release planned for the 17th of this month. If it can make in time for that it'd be fantastic, but if not the next one will follow soon enough (towards end Oct).

pmoravec commented 2 months ago

Hello, yes I should be able to build a package based on a given patch, and test it on my scenario.

Erbenos commented 2 months ago

Upon further look, I don't see any obvious code that would lead to race condition there, so changing code I described above would be completely blind which is not really something I want to do. Guess I ll do my best trying to reproduce it locally.

pmoravec commented 2 months ago

I can provide e.g. collected stats if that helps mimicking my reproducer.

I managed to simplify the reproducer in the way it should be enough to:

I expect you can use just a httpd server that reports several statsd metrics per each URI requested, and fire many tens URIs requests in a loop to the httpd / nginx server.

Erbenos commented 2 months ago

I did some analysis of the source code during the week and I think I found a sequence of events that may lead to the segfault in the hashing procedure (i think). The issue isn't as much some faulty locking, rather "uncomitted" metrics (the check is at https://github.com/performancecopilot/pcp/blob/87dd059ba5196f3d573ea20d47e6a77d8701df2c/src/pmdas/statsd/src/pmda-callbacks.c#L358) that the agent holds in the internal representation being mapped into PCP through the https://github.com/performancecopilot/pcp/blob/87dd059ba5196f3d573ea20d47e6a77d8701df2c/src/pmdas/statsd/src/pmda-callbacks.c#L98.

Creating an internal metric records may happen in multiple steps, most notably when using labels/tags (as metric creation and appending the associated label are two separate operations, each one of them locking/unlocking the same mutex that is used for synchronization of mapping metrics to the PCP) https://github.com/performancecopilot/pcp/blob/87dd059ba5196f3d573ea20d47e6a77d8701df2c/src/pmdas/statsd/src/aggregator-metrics.c#L126 and when the creation of metric succeeds and processing of labels fails, then the metric is deleted from internal hashtable and associated memory freed (and because this memory also holds strings that the PCP metric representation is built with, its understandable that one would get segfault when PCP tries to read that). The reason that these "uncomitted" metrics would pass the check mentioned earlier is that the flag isn't correctly set when memory for internal metric representation is allocated at https://github.com/performancecopilot/pcp/blob/87dd059ba5196f3d573ea20d47e6a77d8701df2c/src/pmdas/statsd/src/aggregator-metrics.c#L299, the flag "pernament" (yeah spelling is incorrect) of metric struct is left un-initialized.

I didn't verify it yet, but it should be doable to do with debugger and setting correct breakpoints and simulating such states, but the main issue for me right now is that I am having problems getting environment setup for PCP to do so.

natoscott commented 2 months ago

@Erbenos let me know if there's anything I can help with the environment - and if you have a patch in mind, I may be able to do a build/test cycle using the internal system @pmoravec is working with.

pmoravec commented 2 months ago

FYI I have standalone reproducer where you need just RHEL8 and installed foreman on top of it. Roughly speaking follow https://theforeman.org/manuals/3.11/index.html#2.1Installation and then enable foreman->statsd->pcp monitoring of everything like Ansible playbook in https://github.com/pmoravec/sat-perf-correlation . In particular, run on a RHEL8 system:

# ensure DNS recognizes FQDN of the host, or run:
echo $(ip a | grep -v "127.0.0.1/8" | grep -m1 "inet " | cut -d/ -f1 | awk '{ print $2 }') $(hostname -f) $(hostname -s) >> /etc/hosts

subscription-manager repos --enable=rhel-8-for-x86_64-baseos-rpms --enable=rhel-8-for-x86_64-appstream-rpms
dnf -y install https://yum.puppet.com/puppet7-release-el-8.noarch.rpm
dnf -y install https://yum.theforeman.org/releases/3.11/el8/x86_64/foreman-release.rpm
dnf -y module enable postgresql:13
dnf -y module enable foreman:el8
dnf install "puppet-agent-7.28*"
dnf -y install foreman-installer

# RECALL THE PASSWORD from the foreman-installer output
foreman-installer

PASSWORD=whatever-installer-prints-to-you  # customize per your installation
curl -k -u admin:${PASSWORD} https://$(hostname -f)/apidoc/v2 | json_reformat | grep -e api_url -e GET | grep -B1 GET | grep api_url | sed "s/:id/1/g" | sed "s/:organization_id/1/g" | sed "s/:location_id/1/g" | sed "s/:user_id/1/g" | cut -d\" -f4 > api_endpoints.txt

dnf -y install python3-policycoreutils pcp pcp-pmda-statsd pcp-system-tools pcp-pmda-openmetrics foreman-pcp

ln -s /etc/pcp/proc/foreman-hotproc.conf /var/lib/pcp/pmdas/proc/hotproc.conf

semanage permissive -a pcp_pmcd_t  # maybe ridiculous but I got some AVCs otherwise..

# Install statsd and hotproc, openmetrics might be redundant
for d in /var/lib/pcp/pmdas/proc /var/lib/pcp/pmdas/statsd /var/lib/pcp/pmdas/openmetrics; do cd $d; ./Install; cd; done

# log statsd metrics
sed -i '/^\[access\]/i log advisory on default {\n  statsd\n}\n' /var/lib/pcp/config/pmlogger/config.default

systemctl enable pmcd pmlogger
systemctl start pmcd pmlogger

# Configure statsd telemetry to report everything in foreman
sed -i '/^  controller: \[/a \ \  \ \".*\",' /usr/share/foreman/config/initializers/5_telemetry.rb
sed -i '/^  action: \[/a \ \  \ \".*\",' /usr/share/foreman/config/initializers/5_telemetry.rb
sed -i '/^  class: \[/a \ \  \ \".*\",' /usr/share/foreman/config/initializers/5_telemetry.rb

# Enable statsd telemetry in foreman
foreman-installer --foreman-telemetry-prometheus-enabled false --foreman-telemetry-statsd-enabled true

# restart foreman related processes, just for sure
systemctl restart foreman dynflow-sidekiq@*

Now, the generated api_endpoints.txt file has 250-ish URIs that we will query for. Some endpoints are wrong, some other will fail, it doesnt matter. Just generate as "wide range" of requests to foreman as possible:

PASSWORD=the-foreman-password-from-installer
hname=$(hostname -f)
for endpoint in $(cat api_endpoints.txt); do
    curl -k -u admin:${PASSWORD} https://${hname}${endpoint}
done

Run that in a loop:

while true; do date; ./reproducer_for_pmdastatsd_segfault.sh > /dev/null 2>&1; sleep 5; done

and wait for segfault. Though, I get it quite sporadically only.

Erbenos commented 2 months ago

The machines that are available to me all run on ARM, emulating different architectures works but is basically unusable for me because its insanely slow. Hence above case is too difficult to setup for me.

On the other note, did you try looking into and could you perhaps share the agent's logs before it errors out? Perhaps that could identify some relevant code paths, especially if the segfault happens in proximity to similar type of message across multiple reproductions. Try to set logging to be as verbose as possible per its documentation.

pmoravec commented 2 months ago

I was testing the recent binary that @natoscott shipped on my testing machine. Running my reproducer for 5 hours, no segfault! (while previously it was a matter of 5-10 minutes).

I am not sure what patch I was previously testing that didnt work (see https://github.com/performancecopilot/pcp/pull/2069#issuecomment-2352643292). Now I do see fully stable pmdastatsd. Great work with the fix, thanks!

Erbenos commented 2 months ago

I ll celebrate when the stability is measured in months not hours 🤣

pmoravec commented 2 months ago

I ll celebrate when the stability is measured in months not hours 🤣

Hold my calendar :D . I run one day test, with pmcd and pmlogger frequent and random restarts (since I noticed the segfaults occur most often within 5-10 minutes after a restart. Once pmdastatsd survives for say hour, it will most probably survive my reproducer forever). Even in that reproducer, the fixed pmdstatsd has not segfaulted yet.

I would call it successful. But if you wish, tell me the amount of time I can leave the test running such long - that is no problem.

Erbenos commented 2 months ago

There is no need, if you believe the issue has been addressed feel free to close it. Feels nice someone is actually using it after such long time and it does serve some purpose.

pmoravec commented 2 months ago

There is no need, if you believe the issue has been addressed feel free to close it. Feels nice someone is actually using it after such long time and it does serve some purpose.

I think so. Thanks for the fix!