seapath / ansible

This repo contains all the ansible playbooks used to deploy or manage a cluster, as well as inventories examples
https://lfenergy.org/projects/seapath/
Apache License 2.0
6 stars 13 forks source link

Realtime performance issues with pacemaker - crm #509

Closed dknueppe closed 2 days ago

dknueppe commented 3 weeks ago

I recently observed a large spike in cache misses occurring frequently. I traced it back to snmp_getdata.py. More specifically I've traced it back to the invocation of "crm status" on line 212 of that script.

To check the cache misses I ran perf stat -e LLC-store-misses -I 2000 in a tmux session and invoked crm status by hand on a separate pane. which results in a ~10x increase in cache-misses. LLC-load-misses had the same results. Oddly enough, even simply invoking crm --help results in the same outcome.

This behavior was observed on Debian, with a standalone setup. crm is the pacemaker command line interfaces, why would that be needed on a standalone setup? Also, does the fact that its a standalone machine mean that there is parts of the setup of pacemaker that's missing, which would stop this behavior from occurring?

Here is what that looks like in conjunction with the 5min (300 second) call interval defined in the cron job the data in the 'perfoutput*' files is simply one column with the amount of cache misses collected over 2s. The first on is after initial seapath installation. The second one is what it looks like after running the debian standalone playbook. The x-axis is frequency in Hz, the values on the y-axis should only be used for comparison between the two and are of very little use otherwise.

import numpy as np
from scipy.fft import fft, fftfreq
import matplotlib.pyplot as plt
base_img_llc_store_misses = np.genfromtxt('perf_output_from132')
after_seapath_img_llc_store_misses = np.genfromtxt('perf_output_from148')
def plot(values):
    N = len(values)
    print(f"Analysed {N} samples.")
    base_fft = fft(values) / N
    # ignore first element because its 0Hz and average latency is distinctively non zero.
    scale = fftfreq(N, 2)[1:N//2]
    plt.plot(scale, 2.0/N * np.abs(base_fft[1:N//2]))
    print(f"Event with highest peak occurs every {1/scale[np.argmax(np.abs(base_fft[1:N//2]))]} seconds" )
    plt.grid()
plot(base_img_llc_store_misses)

Analysed 1669 samples. Event with highest peak occurs every 9.994011976047904 seconds

png

plot(after_seapath_img_llc_store_misses)

Analysed 1814 samples. Event with highest peak occurs every 302.3333333333333 seconds

png

ebail commented 3 weeks ago

Thanks @dknueppe. That is interesting. I see two points:

@insatomcat as Debian maintainer what is your point of view ?

insatomcat commented 3 weeks ago

Hi, The cache misses were never something we looked at to identify high latencies, because they can probably be a cause, not the consequence, and up till now we were looking at the consequence (an actual high latency reported by cyclictest for example). I think the steps to move forward on this are:

@dknueppe, if it appears to be a real concern, do you think cache way allocation could help?

As for why pacemaker is in standalone, that's easy : there is no "standalone" flavor in build_debian_iso, and I don't intend to maintain one (we have already so many to deal with...) since standalone is only for test purposes. The clustering packages (ceph, corosync, pacemaker...) are not configured in standalone and so are not supposed to be a problem, but anyone can remove the packages if he wants.

snmp_getdata.py might have impact on RT

I understand that the "crm" tool has this impact, and since crmsh is central to pacemaker, I can't imagine a cluster running without it... "crm status" is a very important tool for cluster observability.

dknueppe commented 2 weeks ago

Sorry for getting back that late. Running crm status has had a measurable, bad, impact. I was so far unable to fix it with cache way allocation, though I am not yet done with all possible test. One thing that remains to be tested, is chache way allocation for peripherals such as GPU and DMA (in addition to the CPU cores) which may resolve the issue.

I am so far also not sure if the observed cache misses are a symptom or a cause for the above mentioned performance hits. Or whether they might be both, as in something is causing cache misses, and it in turn causes issues with scheduling latencies. But the fact that I could observe both hitting with the same frequency indicates a strong correlation.

insatomcat commented 2 weeks ago

Hello, The thing is when I run a cyclictest on an isolated core, I cannot reproduce any real latency issue with crm status

image
eroussy commented 2 weeks ago

Hi @dknueppe Thanks for the issue. I see that you and Florent are already discussing the performance question, and it's great.

Just an additionnal information about the "standalone" question. SEAPATH should have a standalone build available without all the clustering stuff (ceph, pacemaker, vm-manager etc...) This is already something that is available in Yocto and tested by utilities. There is currently no standalone version for Debian, but there should be one. We just didn't take the time to make one for now.

Could you please open an issue on the build_debian_iso repository about this subject? And maybe, if you have the idea, you could propose a contribution for a standalone build? I'm open to create the specifications with you if you want.

dknueppe commented 2 days ago

Hey there, little embarrassing but I think my initial suspicion of crm being the culprit is wrong. At the time I've seen errors logged in dmesg, and the above shown cache-misses, I assumed I had found the misbehaving program, then continued on disabling the cron job calling the snmp_getdata.py script. You might see where this is going. Testing this again running 'crm status' in a while true loop the impact on latency was minimal. However running the whole snmp_getdata.py script in a while true loop immideately showed a bigger impact on maximum latencies.

With that I think its best to close this issue and re open another one looking into the snmp script.