rundeck / rundeck

Enable Self-Service Operations: Give specific users access to your existing tools, services, and scripts
http://rundeck.org
Apache License 2.0
5.55k stars 921 forks source link

Execution display unresponsive with large number of nodes #8335

Closed alex-umani closed 1 year ago

alex-umani commented 1 year ago

Describe the bug When attempting to view the execution display for a job with a large number of nodes (whether trying to access live display of a running job or reviewing the log for a completed job), the display page goes unresponsive. Waiting eventually displays the results, but this can take between 5 and 10 minutes to complete. The display time for execution logs seems to increase as the number of nodes increases.

With jobs targeting 100-300 nodes, the display is noticeably slow but manageable. As the number of nodes increases, the display time increases. Our largest jobs target 700-800 nodes and take 5-10 minutes to display (sometimes the page completely fails to load even with this amount of time).

This appears to be independent of log size and entirely dependent on number of nodes. We have a job that targets a single node and produces an 8.5G log. This display page loads immediately (with the proper "whale" notice). A job that targets 750 nodes but only produces a 4.5G log takes the 5-10 minutes to display properly with the page going unresponsive while loading.

No errors appear in the service log or any of the rundeck logs. There is no appreciable impact on CPU or RAM on either the application host or the database host while these pages are loading. It appears to be almost entirely a client-side javascript loop.

The behaviour has been observed consistently on our server using Chrome, Firefox, and Edge browsers.

Full details and discussion, including confirmed reproduction of the issue, is here: https://groups.google.com/g/rundeck-discuss/c/7jX68I0iUxs

My Rundeck detail

To Reproduce Steps to reproduce the behavior:

  1. Run a job or command execution against 700+ nodes
  2. Go to the job Activity list
  3. Click on the execution to load its display
  4. The page goes unresponsive for a long time then eventually loads

Expected behavior The page should load without significant delays, certainly within the timeout notice period of standard browsers.

Desktop (please complete the following information):

Additional context The issue was first noted after we upgraded from 3.3.16 to 4.10.1 (following all interim upgrade steps as recommended). Recently upgraded to 4.13.0 and the issue persists.

I've run a Firefox profiler check on it and the entire delay seems to be a nested loop issue in /assets/static/components/uisockets-f616e545f28116cec9fa19bdff6c2cc3.js although I'm unclear from looking at the code what exactly that script is doing.

MegaDrive68k commented 1 year ago

Confirmed.

  1. Generate a docker environment to create 700 ssh nodes.

The following script generates a docker-compose.yaml, the script asks you about the number of nodes (700 in this case).

#!/bin/bash

echo 'Number of nodes to create?';
read nodeAmount;

let nodeAmount=$nodeAmount;

echo 'version: "3"
services:' >> docker-compose.yaml;

for i in $(seq 0 $nodeAmount);
do
    let numNode=$i+1;
    if [ $i -lt 9 ]; then
        numNode="0$((i+1))";
    fi
    echo "  ubuntu-$numNode:
    build: ./
    image: ubuntu-lunar
    container_name: ubuntu-$numNode
    ports:
      - '127.0.0.1:20$numNode:22'" >> docker-compose.yaml;
done

The result file uses this Dockerfile:

FROM ubuntu:latest

RUN apt update && apt install openssh-server sudo -y

RUN apt install openssh-client sudo -y

RUN useradd -rm -s /bin/bash -g root -G sudo -u 1000 test

RUN  echo 'test:test' | chpasswd

RUN mkdir -p /home/test/.ssh && \
    chmod 0700 /home/test/.ssh

RUN ssh-keygen -q -t rsa -N '' -f /home/test/.ssh/id_rsa

RUN chown -R test:root /home/test/.ssh

RUN sed -i'' -e's/^#PermitRootLogin prohibit-password$/PermitRootLogin yes/' /etc/ssh/sshd_config \
        && sed -i'' -e's/^#PasswordAuthentication yes$/PasswordAuthentication yes/' /etc/ssh/sshd_config \
        && sed -i'' -e's/^#PermitEmptyPasswords no$/PermitEmptyPasswords yes/' /etc/ssh/sshd_config \
        && sed -i'' -e's/^UsePAM yes/UsePAM no/' /etc/ssh/sshd_config \
        && sed -i'' -e's/^#PubkeyAuthentication yes$/PubkeyAuthentication no/' /etc/ssh/sshd_config

RUN service ssh start

EXPOSE 22

CMD ["/usr/sbin/sshd","-D"]
  1. Now generate the resources.xml file according to the docker containers, the following script creates that file:
#!/bin/bash

echo 'Number of nodes to create?';
read nodeAmount;

let nodeAmount=$nodeAmount-1;

echo "<?xml version='1.0' encoding='UTF-8'?>
<project>" >> resources.xml
for i in $(seq 0 $nodeAmount);
do
    let numNode=$i+1;
    if [ $i -lt 9 ]; then
        numNode="0$((i+1))";
    fi
    echo " <node
    name='"ubuntu-${numNode}"'
    tags='docker'
    hostname='"127.0.0.1:20${numNode}"'
    osArch='amd64'
    osFamily='unix'
    osName='Linux'
    username='test'
    ssh-port='"20${numNode}"'
    ssh-authentication='password'
    ssh-password-storage-path ='keys/ubuntu.password'/>" >> resources.xml
done
echo "</project>" >> resources.xml
  1. Now, build the images with the following command: docker compose build.
  2. Start the remote ssh containers (this could take some minutes) docker compose up.
  3. Start a local Rundeck 4.13 instance and add the resources.xml file generated in step 2 as a file model source.
  4. Dispatch a command against the 700 nodes.

The Rundeck 4.13 activity page gets unresponsive:

41zHnkQ

This doesn't occur on Rundeck 3.4.

Thanks for your feedback @alex-umani!

tanji commented 1 year ago

4.9 does not have this issue FYI. Confirmed on 4.13

sbriglio commented 1 year ago

We incurred in the same bug after upgrading from Rundeck 3.4.10 to 4.13.0

padraiglennon commented 1 year ago

We are having the same issue

On Mon 19 Jun 2023, 12:54 sbriglio, @.***> wrote:

We incurred in the same bug after upgrading from Rundeck 3.4.10 to 4.13.0

— Reply to this email directly, view it on GitHub https://github.com/rundeck/rundeck/issues/8335#issuecomment-1597047879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFWAVI4HYGO4NTZ2ONP4DLXMA4YDANCNFSM6AAAAAAYIEVNIM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

simon-c-msc commented 1 year ago

The issue is still present in version 4.14.1

eversonleal commented 1 year ago

Same after upgrade from 4.8.0 to 4.14.2

emperortomato commented 1 year ago

This is still an issue on Rundeck 4.15.0. Loading an execution with 278 nodes takes between 35-60s depending on the browser (and how many other things I'm doing on my client machine). Chrome seems to load the executions faster than Firefox. The page is hung and unusable until all nodes have been loaded.

There is no noticeable increase in load on the server side, but there is a very noticeable increase in CPU load on the client system loading the page. I'm running on a system with a fairly old CPU (quad core i5-4590) and my CPU use jumps from 40% -> 80-100% when loading this example execution.

sbriglio commented 1 year ago

Fixed in 4.16.0