jhammond / xltop

continuous Lustre load monitor
GNU General Public License v2.0
21 stars 7 forks source link

difference to values in IB monitor #1

Open mkluge opened 11 years ago

mkluge commented 11 years ago

Hi John,

do you still maintain xltop? I installed it on a Lustre 2.1.3 cluster and have in parallel a small script running that queries the IB port. I see a large difference between throughput reported by the IB monitor and "xltop u s". The sum of the throughput on all IB port of all oss servers matches the sum of the throughput for all servers as reported by xltop. The numbers are just differently distributed. As the IB monitor only uses "perfquery -r" once a second, I believe this data more than xltop. Do you have any idea how to debug this?

Regards, Michael

jhammond commented 11 years ago

Hi Michael,

Sorry for the delay. I though I responded to you over the weekend but it must have never been sent.

I haven't had any reason to update xltop in some time so I haven't been actively maintaining it.

The difference you seem may be explained by the difference in sampling intervals or by that fact that xltop uses a moving average whereas perfquery -r will give you counter deltas. What values are you using for the tick, window, and interval in your xltop-master.conf?

Best,

John

mkluge commented 11 years ago

Hi John,

tick = 2 window = 5 interval=5

The benchmark runs pretty long (> 5mins) and shows the same values all the time. I have 4 servers, showing 7,5,5, and 3 GB/s while the live IB and the live OST monitor (little scripts I wrote) show both 5 GB/s on all servers all the time.

Regards, Michael

Dr.-Ing. Michael Kluge

Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany

Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge@tu-dresden.de WWW: http://www.tu-dresden.de/zih

Am 17.09.2013 um 20:06 schrieb John Hammond:

Hi Michael,

Sorry for the delay. I though I responded to you over the weekend but it must have never been sent.

I haven't had any reason to update xltop in some time so I haven't been actively maintaining it.

The difference you seem may be explained by the difference in sampling intervals or by that fact that xltop uses a moving average whereas perfquery -r will give you counter deltas. What values are you using for the tick, window, and interval in your xltop-master.conf?

Best,

John

— Reply to this email directly or view it on GitHub.

jhammond commented 11 years ago

Thanks Michael. I'll take a look at the code. In the mean time, could you try again with (tick, window, interval) = (1, 5, 5) and (5, 5, 5)?

mkluge commented 11 years ago

Hi John,

did that. I took a couple of screenshots (attached). The upper part of the image shows "xltop u s", the middle part a 1 second interval sum of the values in /proc/fs/lustre/obdfilter/scratch-*/stats on taurusoss2 and the lower part a 1 second interval dump of both IB interfaces of taurusoss2 as well.

The oss2_write_phase* screenshots are very interesting. The screenshots were taken about 30 seconds after the benchmark started writing in an interval of about 15-30 seconds. The values that xltop shows for taurusoss2 are somehow never more that a factor of 2 away from the real value. I'm attaching the oss monitor script, just to make sure ...

Regards, Michael

--- 8< ----------------------------------------------- for OST in ls -1d /proc/fs/lustre/obdfilter/scratch-* ; do OLD[$OST]=0 done

while [ 1 ] ; do sleep 1 SUM=0 for OST in ls -1d /proc/fs/lustre/obdfilter/scratch-* ; do NAME=echo $OST | cut -d / -f 6 VAL=cat $OST/stats | grep write_bytes | awk '{print $7}' OV=${OLD[$OST]} DIFF=$(($VAL-$OV)) DIFF=$(($DIFF/1024)) DIFF=$(($DIFF/1024))

echo "$NAME: $DIFF MB/s"

    OLD[$OST]=$VAL
    SUM=$(($SUM+$DIFF))
done
echo "SUM      : $SUM MB/s"

done --- 8< -----------------------------------------------

On 18.09.2013 13:27, John Hammond wrote:

Thanks Michael. I'll take a look at the code. In the mean time, could you try again with (tick, window, interval) = (1, 5, 5) and (5, 5, 5)?

— Reply to this email directly or view it on GitHub https://github.com/jhammond/xltop/issues/1#issuecomment-24657418.

Dr.-Ing. Michael Kluge

Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany

Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge@tu-dresden.de WWW: http://www.tu-dresden.de/zih