jhammond / xltop

continuous Lustre load monitor
GNU General Public License v2.0
21 stars 7 forks source link

xltop is a continuous Lustre monitor with batch scheduler intergation.

The organization of xltop is shown below and consists of a single master process (xltop-master), an ncurses based client (xltop), process(es) to push batch scheduler job mappings (xltop-clusd), and for each MDS/OSS a daemon (xltop-servd) to monitor Lustre export stats via /proc.

                 xltop-master
                 /    |    \
                /     |     \
               /      |      \
          xltop       |      xltop-servd
    (ncurses client)  |  (MDS/OSS monitor daemon)
                      |
                 xltop-clusd
            (job mapping daemon)

xltop-master maintains a hierarchy of flows between consumers (hosts, jobs, clusters, the universe) and providers (targets (TODO), servers, filesystems, and the universe).

The design of xltop assumes that: 0) hosts never run more than one job at a time, 1) file systems may be mounted by multiple clusters, 2) hosts may belong to multiple lnets, 3) two servers may use different NIDs for the same host, 4) the same lnet name (for example o2ib) may be used multiple times,

Client Usage: xltop [OPTIONS]... [EXPRESSION...] Expression may be one of: owner=OWNER, clus[=CLUS], job[=JOB], host[=HOST], fs[=FS], serv[=SERV] Types may be given by their first character; use 'u' and 'v' for the universes.

OPTIONS: -c, --conf-dir=DIR read configuration from DIR -f, --full-names show full host, job, and server names -h, --help display this help and exit -i, --interval=SECONDS update every SECONDS sceonds -k, --key=KEY1[,KEY2...] sort results by KEY1,... -l, --limit=NUM limit responses to NUM results -p, --remote-port=PORT connect to master at PORT -r, --remote-host=HOST connect to master on HOST -s, --show-sum show sums rather than rates -u, --ubuntu look snazzy on my terminal (terrible on xterms)

SORTING: Sort keys are case insensitive. Keys containing 'W' select bytes written (as a rate or sum according to use of -s, --show-sum). Similarly keys containing 'R' and 'Q' are interpreted as bytes read and requests.

EXAMPLES: xltop job serv (or xltop j s) # Show traffic between jobs and servers xltop h=i101-101.ranger.tacc.utexas.edu # Show i101-101 to each filesystem xltop o=kbdcat h s # Show hosts in kbdcat's jobs to /scratch servers xltop h fs=ranger-scratch s # All hosts to /scratch servers xltop h s=oss23.ranger.tacc.utexas.edu # All hosts to oss23 xltop j=2411369@ranger s # Job 2411369 to all servers

Client screenshot: FILESYSTEM MDS/T LOAD1 LOAD5 LOAD15 TASKS OSS/T LOAD1 LOAD5 LOAD15 TASKS NIDS ranger-work 1/1 1.52 3.48 4.41 609 14/84 2.74 2.08 2.09 1347 4212 ranger-scratch 1/1 0.13 0.20 0.54 584 50/300 2.52 1.94 1.52 1348 4213 ranger-share 1/1 0.93 1.20 1.72 544 6/36 3.55 1.37 0.90 1203 3960 JOB FS WR_MB/S RD_MB/S REQS/S OWNER NAME HOSTS 2526717 ranger-scratch 321.557 5.994 3556.133 tg803155 NST3.28-r0 20 login4 ranger-scratch 38.489 55.054 469.943 NONE NONE 1 2530927 ranger-scratch 16.526 0.000 39.942 dkcira Parametric 1 2529449 ranger-work 11.754 0.000 24.088 bealing PE-OH 4 2530975 ranger-work 11.108 0.007 23.620 vishnam2 batch 16 2529064 ranger-scratch 10.823 0.091 30.231 jc64248 runwrf 12 2530142 ranger-scratch 9.418 0.948 49.386 subin hnbo 16 2530928 ranger-scratch 7.991 0.000 18.092 dkcira Parametric 1 2530142 ranger-work 7.066 8.501 410.608 subin hnbo 16 2528099 ranger-scratch 6.775 0.000 69.664 dvt5076 wrfnolan 32 2529489 ranger-work 6.166 0.000 13.467 yhe tenWcross 32 2529788 ranger-work 6.116 0.000 13.332 yhe tenWcross_ 32 2528098 ranger-scratch 6.022 0.000 67.693 dvt5076 wrf_nolan 32 2530176 ranger-scratch 5.972 0.000 33.241 tg458470 d2o_mol32 8 2530055 ranger-scratch 5.215 0.000 18.108 yping bivkp 9 2529790 ranger-work 5.125 0.000 12.169 yhe tenWcross 32 2529920 ranger-work 5.085 0.009 12.057 hli iascpl 16 2531057 ranger-work 3.835 0.001 8.347 tibor ttthermo 60 2530526 ranger-scratch 2.572 0.000 20.855 pchen mguadtpc 61 2530179 ranger-scratch 2.554 0.000 24.010 tg458470 d2o_mol_32 8 2526200 ranger-scratch 2.128 0.000 5.035 tg803418 ca0.1v0.75 1 2529517 ranger-scratch 1.839 0.000 3.773 lianyuan usf_fvcom 32 2526222 ranger-scratch 1.751 0.000 4.738 tg803418 ca0.6v0.75 1 2528401 ranger-scratch 1.638 0.000 3.397 gnelson equil_outs 8 2531086 ranger-work 1.626 0.340 13.370 cliu0001 schmIin_de 15 2530273 ranger-work 1.563 0.000 5.624 amorris Thrust4.00 3 2528400 ranger-scratch 1.528 0.000 3.151 gnelson equil_midd 8 2522312 ranger-work 1.089 0.366 3.846 frederic romsnep6 16 2528755 ranger-scratch 1.003 0.000 2.530 tg458315 h1.5dy0.35 1 2529262 ranger-scratch 0.992 0.000 2.532 tg459103 pip3 4 2526210 ranger-scratch 0.973 0.006 2.319 tg803418 ca0.3v0.75 1 2530934 ranger-scratch 0.913 0.000 5.108 pedram Shear19 24 2529294 ranger-scratch 0.895 0.000 2.798 tgaborek wat-PLL-56 5 2530294 ranger-share 0.848 0.000 10.359 psr53 CMS21st 4 2529399 ranger-scratch 0.834 0.000 2.173 tg459103 pip2 4 2531092 ranger-work 0.755 0.000 7.542 avm aka2n_74 4 2527596 ranger-scratch 0.721 0.000 36.401 gmcivor uli_256 16 1-37 out of 2730 xltop - Wed Apr 25 08:01:15 2012

Build requires ncurses, libcurl, libev-4.4, and libconfuse-2.7.