openshift-scale / workloads

Tool an OpenShift cluster and Run OpenShift Performance and Scale Workloads
Apache License 2.0
12 stars 31 forks source link

N^2 pbench data being collected #73

Open portante opened 5 years ago

portante commented 5 years ago

There is evidence to suggest that because of lines 22 - 33 of master/workloads/templates/workload-baseline-script-cm.yml.j2 all nodes are copy tool data from all other nodes to their local file system. This seems to result in the actual controller having multiple unexpected sub-directories in stored tool data containing the resulting data.

If lines 22-33 are only executed on the pbench controller so that only that tools-default file has the remote hosts listed, this would not happen.

This is resulting in unexpectedly large tar balls of pbench data being collected.

ekuric commented 5 years ago

It only creates label

# for m in $(oc get pods  | awk '{print $1}'  |grep -v NAME ); do oc exec $m -- hostname;  oc exec $m -- cat /var/lib/pbench-agent/tools-default/label; done 
ip-10-0-162-181
infraip-10-0-130-83
infraip-10-0-180-33
infraip-10-0-129-183
masterip-10-0-128-162
workerip-10-0-133-71
workerip-10-0-136-159
workload

for fio test we have below organization once pbench tools are registered. I think it is not recursive @akrzos @chaitanyaenr opinions?

# for m in $(oc get pods  | awk '{print $1}'  |grep -v NAME ); do oc exec $m -- hostname;  oc exec $m -- ls -l /var/lib/pbench-agent/tools-default; done 
ip-10-0-162-181
total 0
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 disk -> ..data/disk
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 iostat -> ..data/iostat
lrwxrwxrwx. 1 root root 12 Aug  5 09:40 label -> ..data/label
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 mpstat -> ..data/mpstat
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 perf -> ..data/perf
lrwxrwxrwx. 1 root root 14 Aug  5 09:40 pidstat -> ..data/pidstat
lrwxrwxrwx. 1 root root 10 Aug  5 09:40 sar -> ..data/sar
ip-10-0-130-83
total 0
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 disk -> ..data/disk
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 iostat -> ..data/iostat
lrwxrwxrwx. 1 root root 12 Aug  5 09:40 label -> ..data/label
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 mpstat -> ..data/mpstat
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 perf -> ..data/perf
lrwxrwxrwx. 1 root root 14 Aug  5 09:40 pidstat -> ..data/pidstat
lrwxrwxrwx. 1 root root 10 Aug  5 09:40 sar -> ..data/sar
ip-10-0-180-33
total 0
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 disk -> ..data/disk
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 iostat -> ..data/iostat
lrwxrwxrwx. 1 root root 12 Aug  5 09:40 label -> ..data/label
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 mpstat -> ..data/mpstat
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 perf -> ..data/perf
lrwxrwxrwx. 1 root root 14 Aug  5 09:40 pidstat -> ..data/pidstat
lrwxrwxrwx. 1 root root 10 Aug  5 09:40 sar -> ..data/sar
ip-10-0-129-183
total 0
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 disk -> ..data/disk
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 iostat -> ..data/iostat
lrwxrwxrwx. 1 root root 12 Aug  5 09:40 label -> ..data/label
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 mpstat -> ..data/mpstat
lrwxrwxrwx. 1 root root  9 Aug  5 09:40 oc -> ..data/oc
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 perf -> ..data/perf
lrwxrwxrwx. 1 root root 14 Aug  5 09:40 pidstat -> ..data/pidstat
lrwxrwxrwx. 1 root root 10 Aug  5 09:40 sar -> ..data/sar
ip-10-0-128-162
total 0
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 disk -> ..data/disk
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 iostat -> ..data/iostat
lrwxrwxrwx. 1 root root 12 Aug  5 09:40 label -> ..data/label
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 mpstat -> ..data/mpstat
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 perf -> ..data/perf
lrwxrwxrwx. 1 root root 14 Aug  5 09:40 pidstat -> ..data/pidstat
lrwxrwxrwx. 1 root root 10 Aug  5 09:40 sar -> ..data/sar
ip-10-0-133-71
total 0
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 disk -> ..data/disk
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 iostat -> ..data/iostat
lrwxrwxrwx. 1 root root 12 Aug  5 09:40 label -> ..data/label
lrwxrwxrwx. 1 root root 13 Aug  5 09:40 mpstat -> ..data/mpstat
lrwxrwxrwx. 1 root root 11 Aug  5 09:40 perf -> ..data/perf
lrwxrwxrwx. 1 root root 14 Aug  5 09:40 pidstat -> ..data/pidstat
lrwxrwxrwx. 1 root root 10 Aug  5 09:40 sar -> ..data/sar
ip-10-0-136-159
total 48
-rw-r--r--. 1 root root 1 Aug  6 09:27 disk
-rw-r--r--. 1 root root 1 Aug  6 09:27 iostat
-rw-r--r--. 1 root root 9 Aug  6 09:27 label
-rw-r--r--. 1 root root 1 Aug  6 09:27 mpstat
-rw-r--r--. 1 root root 1 Aug  6 09:27 oc
-rw-r--r--. 1 root root 1 Aug  6 09:27 perf
-rw-r--r--. 1 root root 1 Aug  6 09:27 pidstat
-rw-r--r--. 1 root root 7 Aug  6 09:27 remote@ip-10-0-128-162.us-west-2.compute.internal
-rw-r--r--. 1 root root 7 Aug  6 09:27 remote@ip-10-0-129-183.us-west-2.compute.internal
-rw-r--r--. 1 root root 6 Aug  6 09:27 remote@ip-10-0-130-83.us-west-2.compute.internal
-rw-r--r--. 1 root root 7 Aug  6 09:27 remote@ip-10-0-133-71.us-west-2.compute.internal
-rw-r--r--. 1 root root 1 Aug  6 09:27 sar
akrzos commented 5 years ago

There is evidence to suggest that because of lines 22 - 33 of master/workloads/templates/workload-baseline-script-cm.yml.j2 all nodes are copy tool data from all other nodes to their local file system. This seems to result in the actual controller having multiple unexpected sub-directories in stored tool data containing the resulting data.

If lines 22-33 are only executed on the pbench controller so that only that tools-default file has the remote hosts listed, this would not happen.

This is resulting in unexpectedly large tar balls of pbench data being collected.

Hi @portante can you provide us the evidence so we can investigate ourselves?

My first impression is this should not be occurring as logically this code simply prints only hostnames of nodes determined by labeling for starting/stopping/copying of tool data as @ekuric has pointed out in his example.

portante commented 5 years ago

@akrzos, @ekuric, yes the output shown above of the contents of the tools-default directory in all pods looks healthy, that is to say, only one pod (which should be the controller) has all the remote@* files in the directory.

As for evidence, I have one from April 8th that is from our internal server (the URL I can't share here), but the output from a BAD tools directory looks like:

-bash-4.2$ pwd
/pbench/public_html/incoming/EC2::ip-10-0-13-31/uperf_node-to-node_UDP_2019.04.08T12.29.15/1-udp_stream-64B-1i/sample1/tools-default
-bash-4.2$ ls -l
total 12
drwxr-xr-x  3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 18 pbench pbench 4096 Apr  8 19:22 ip-10-0-13-31
drwxr-xr-x  3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x  2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x  4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x  3 pbench pbench  302 Apr  8 19:22 proc-interrupts
drwxr-xr-x  3 pbench pbench  183 Apr  8 19:22 proc-vmstat
drwxr-xr-x  3 pbench pbench 4096 Apr  8 19:22 sar
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_1:ip-10-0-129-28.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_2:ip-10-0-159-137.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_3:ip-10-0-174-109.us-west-2.compute.internal
drwxr-xr-x 11 pbench pbench  179 Apr  8 19:22 svt_master_1_etcd_1:ip-10-0-131-131.us-west-2.compute.internal
drwxr-xr-x 10 pbench pbench  149 Apr  8 19:22 svt_master_2_etcd_2:ip-10-0-159-74.us-west-2.compute.internal
drwxr-xr-x 10 pbench pbench  149 Apr  8 19:22 svt_master_3_etcd_3:ip-10-0-173-64.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_node_1:ip-10-0-136-141.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_node_2:ip-10-0-153-144.us-west-2.compute.internal
drwxr-xr-x  2 pbench pbench   99 Apr  8 19:22 turbostat
-bash-4.2$ ls -l ip-10-0-13-31 svt_*
ip-10-0-13-31:
total 8
drwxr-xr-x  3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x  3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x  2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x  4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x  3 pbench pbench  302 Apr  8 19:22 proc-interrupts
drwxr-xr-x  3 pbench pbench  183 Apr  8 19:22 proc-vmstat
drwxr-xr-x  3 pbench pbench 4096 Apr  8 19:22 sar
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_1:ip-10-0-129-28.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_2:ip-10-0-159-137.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_3:ip-10-0-174-109.us-west-2.compute.internal
drwxr-xr-x 11 pbench pbench  179 Apr  8 19:22 svt_master_1_etcd_1:ip-10-0-131-131.us-west-2.compute.internal
drwxr-xr-x 10 pbench pbench  149 Apr  8 19:22 svt_master_2_etcd_2:ip-10-0-159-74.us-west-2.compute.internal
drwxr-xr-x 10 pbench pbench  149 Apr  8 19:22 svt_master_3_etcd_3:ip-10-0-173-64.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_node_1:ip-10-0-136-141.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_node_2:ip-10-0-153-144.us-west-2.compute.internal
drwxr-xr-x  2 pbench pbench   99 Apr  8 19:22 turbostat

svt_infra_1:ip-10-0-129-28.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_infra_2:ip-10-0-159-137.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_infra_3:ip-10-0-174-109.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_master_1_etcd_1:ip-10-0-131-131.us-west-2.compute.internal:
total 12
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 5 pbench pbench  176 Apr  8 19:22 haproxy-ocp
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  179 Apr  8 19:22 oc
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 prometheus-metrics
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_master_2_etcd_2:ip-10-0-159-74.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 5 pbench pbench  176 Apr  8 19:22 haproxy-ocp
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  179 Apr  8 19:22 oc
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_master_3_etcd_3:ip-10-0-173-64.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 5 pbench pbench  176 Apr  8 19:22 haproxy-ocp
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  179 Apr  8 19:22 oc
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_node_1:ip-10-0-136-141.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_node_2:ip-10-0-153-144.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

It should look like:

-bash-4.2$ pwd
/pbench/public_html/incoming/EC2::ip-10-0-13-31/uperf_node-to-node_UDP_2019.04.08T12.29.15/1-udp_stream-64B-1i/sample1/tools-default
-bash-4.2$ ls -l
total 9
drwxr-xr-x 18 pbench pbench 4096 Apr  8 19:22 ip-10-0-13-31
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_1:ip-10-0-129-28.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_2:ip-10-0-159-137.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_infra_3:ip-10-0-174-109.us-west-2.compute.internal
drwxr-xr-x 11 pbench pbench  179 Apr  8 19:22 svt_master_1_etcd_1:ip-10-0-131-131.us-west-2.compute.internal
drwxr-xr-x 10 pbench pbench  149 Apr  8 19:22 svt_master_2_etcd_2:ip-10-0-159-74.us-west-2.compute.internal
drwxr-xr-x 10 pbench pbench  149 Apr  8 19:22 svt_master_3_etcd_3:ip-10-0-173-64.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_node_1:ip-10-0-136-141.us-west-2.compute.internal
drwxr-xr-x  8 pbench pbench  112 Apr  8 19:22 svt_node_2:ip-10-0-153-144.us-west-2.compute.internal
-bash-4.2$ ls -l ip-10-0-13-31 svt_*
ip-10-0-13-31:
total 8
drwxr-xr-x  3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x  3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x  2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x  4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x  3 pbench pbench  302 Apr  8 19:22 proc-interrupts
drwxr-xr-x  3 pbench pbench  183 Apr  8 19:22 proc-vmstat
drwxr-xr-x  3 pbench pbench 4096 Apr  8 19:22 sar
drwxr-xr-x  2 pbench pbench   99 Apr  8 19:22 turbostat

svt_infra_1:ip-10-0-129-28.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_infra_2:ip-10-0-159-137.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_infra_3:ip-10-0-174-109.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_master_1_etcd_1:ip-10-0-131-131.us-west-2.compute.internal:
total 12
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 5 pbench pbench  176 Apr  8 19:22 haproxy-ocp
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  179 Apr  8 19:22 oc
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 prometheus-metrics
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_master_2_etcd_2:ip-10-0-159-74.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 5 pbench pbench  176 Apr  8 19:22 haproxy-ocp
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  179 Apr  8 19:22 oc
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_master_3_etcd_3:ip-10-0-173-64.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 5 pbench pbench  176 Apr  8 19:22 haproxy-ocp
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  179 Apr  8 19:22 oc
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_node_1:ip-10-0-136-141.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

svt_node_2:ip-10-0-153-144.us-west-2.compute.internal:
total 8
drwxr-xr-x 3 pbench pbench  158 Apr  8 19:22 disk
drwxr-xr-x 3 pbench pbench  154 Apr  8 19:22 iostat
drwxr-xr-x 3 pbench pbench  256 Apr  8 19:22 mpstat
drwxr-xr-x 2 pbench pbench  280 Apr  8 19:22 perf
drwxr-xr-x 4 pbench pbench 4096 Apr  8 19:22 pidstat
drwxr-xr-x 3 pbench pbench 4096 Apr  8 19:22 sar

This results from having the "remote@*" files in the tool directories on all hosts, and so pbench-postprocess-tools ends up trying to copy the data from each other host to itself. Depending on the order in which tools execute and how much happens in parallel, this can result in N^2 data copies.

To prevent this, using pbench-register-tool[-set] only on the pbench "controller" (the node from which pbench-user-benchmark or pbench-fio, etc is run) will prevent that behavior.

Since you can't today rely on pbench-register-tool[-set], making sure that only the remote@* files land on the pbench "controller" would be sufficient to fix this.

akrzos commented 5 years ago

So this must be from the old svt version of tooling because openshift-scale/workloads didn't exist until late may. (See first commit - https://github.com/openshift-scale/workloads/commit/fa15713e362cdd910dc5bcaf58be7a500f406103 )

This is further illustrated by the labels beginning with "svt_" in the example. The labels from a workload run here will resemble:

[DIR]   master:ip-10-0-142-72.us-west-2.compute.internal/   2019-08-06 16:58    -    
[DIR]   master:ip-10-0-146-209.us-west-2.compute.internal/  2019-08-06 16:58    -    
[DIR]   master:ip-10-0-175-107.us-west-2.compute.internal/  2019-08-06 16:58    -    
[DIR]   worker:ip-10-0-134-97.us-west-2.compute.internal/   2019-08-06 16:58    -    
[DIR]   worker:ip-10-0-140-49.us-west-2.compute.internal/   2019-08-06 16:58    -    
[DIR]   workload:ip-10-0-14-211/    2019-08-06 16:58    -    

Lastly configmaps mounted are read only:

root@ip-172-31-32-21: ~ # oc -n scale-ci-tooling rsh pbench-agent-master-8g47h
sh-4.2# cd /var/lib/pbench-agent/tools-default
sh-4.2# pwd
/var/lib/pbench-agent/tools-default
sh-4.2# touch test
touch: cannot touch 'test': Read-only file system

Thus this condition should never occur since the configmap would overwrite what the controller is writing.

portante commented 5 years ago

I see this on other more recent runs as well:

/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/iostat
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/mpstat
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/perf
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pidstat
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-LB_PODS_10:master-0
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-20
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-20.openshift.example.com
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-6
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-6.openshift.example.com
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-7
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-7.openshift.example.com
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/proc-interrupts
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/proc-vmstat
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/sar
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/turbostat
akrzos commented 5 years ago

I see this on other more recent runs as well:

/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/iostat
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/mpstat
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/perf
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pidstat
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-LB_PODS_10:master-0
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-20
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-20.openshift.example.com
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-6
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-6.openshift.example.com
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-7
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/pod-to-pod-NN_PODS_1:app-node-7.openshift.example.com
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/proc-interrupts
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/proc-vmstat
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/sar
/pbench/public_html/incoming/master-0/uperf_pod-to-pod-LB_PODS_10_UDP_2019.06.19T20.43.04/1-udp_stream-64B-1i/sample1/tools-default/turbostat

This shows tools proc-interrupts and proc-vmstat both of which we do not register with this tooling (nor ever have) but the svt tooling does. This result also looks to be from openshift on openstack which no-one from our specific team was running at that time period. Maybe @smalleni or someone from the shiftstack team might have been running at this time. Can you try reproducing this with the actual tooling?