hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

LSF_UNIT_FOR_LIMITS not used for resource requests #2

Closed EricR86 closed 10 years ago

EricR86 commented 10 years ago

Original report (BitBucket issue) by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


Our LSF system is set up with GB as the LSF_UNIT_FOR_LIMITS:

$ badmin showconf mbd | grep UNIT
    LSF_UNIT_FOR_LIMITS = GB

And segway.cluster.lsf parses this correctly as the DIVISOR_FOR_LIMITS from our lsf.conf file.

However, for the resource request, the DIVISOR_FOR_LIMITS is ignored and instead is specified in MB, irrespective of the configured unit:

select[mem>2148 && tmp>48] rusage[mem=2148, tmp=48]

So on our system, segway is asking for 2148 GB, which is awesome, but exceeds that RAM on our compute blades. The jobs stay in PEND status and never run.

This comment in lsf.py::make_res_req() suggests that resources are always specified in MB:

    # always specified in MB, unaffected by LSF_UNIT_FOR_LIMITS
    # see Administering Platform LSF: Working with Resources:
    # Understanding Resources

I looked at that document, and don't see the logic for always specifying resources in MB.

So:

  1. Updating to the following in segway.cluster.lsf lets my jobs run:

    mem_usage_mb = ceildiv(mem_usage, DIVISOR_FOR_LIMITS)
    tmp_usage_mb = ceildiv(tmp_usage, DIVISOR_FOR_LIMITS)
  2. Does the comment suggest that we are somehow inappropriately managing LSF? I can't change LSF_UNIT_FOR_LIMITS, it would affect many users.

EricR86 commented 10 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


This used to be the case. It seems Platform/IBM have changed it in recent versions.

What version of LSF are you using? What's the output of lsinfo?

EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


EricR86 commented 10 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


Here is my LSF info:

$ lsinfo -V
    Platform LSF 7.0.6.134609, Sep 04 2009
    Copyright 1992-2009 Platform Computing Corporation

         binary type: linux2.6-glibc2.3-x86_64

I guess it would be possible to set these values based on LSF versions. I can fix and submit a pull request if you send your version info. Would be nice to know in which version of LSF this behavior changed, if that's the case.

EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


Can you send it without -V so I can see what the resource information looks like? I am wary of distinguishing based on version rather than something functional.

I no longer have access to an LSF system so I can't test this myself anymore.

EricR86 commented 10 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


Full input of lsinfo:

#!bash

jhessel@amc-tesla ~
$ lsinfo 
RESOURCE_NAME   TYPE   ORDER  DESCRIPTION
r15s          Numeric   Inc   15-second CPU run queue length
r1m           Numeric   Inc   1-minute CPU run queue length (alias: cpu)
r15m          Numeric   Inc   15-minute CPU run queue length
ut            Numeric   Inc   1-minute CPU utilization (0.0 to 1.0)
pg            Numeric   Inc   Paging rate (pages/second)
io            Numeric   Inc   Disk IO rate (Kbytes/second)
ls            Numeric   Inc   Number of login sessions (alias: login)
it            Numeric   Dec   Idle time (minutes) (alias: idle)
tmp           Numeric   Dec   Disk space in /tmp (Mbytes)
swp           Numeric   Dec   Available swap space (Mbytes) (alias: swap)
mem           Numeric   Dec   Available memory (Mbytes)
root          Numeric   Dec   HPC ELIM
maxroot       Numeric   Dec   HPC ELIM
processes     Numeric   Dec   HPC ELIM
clockskew     Numeric   Dec   HPC ELIM
ncpus         Numeric   Dec   Number of CPUs
ndisks        Numeric   Dec   Number of local disks
maxmem        Numeric   Dec   Maximum memory (Mbytes)
maxswp        Numeric   Dec   Maximum swap space (Mbytes)
maxtmp        Numeric   Dec   Maximum /tmp space (Mbytes)
cpuf          Numeric   Dec   CPU factor
rexpri        Numeric   N/A   Remote execution priority
nprocs        Numeric   Dec   Number of physical processors
ncores        Numeric   Dec   Number of cores per physical processor
nthreads      Numeric   Dec   Number of threads per processor core
server        Boolean   N/A   LSF server host
LSF_Base      Boolean   N/A   Base product
lsf_base      Boolean   N/A   Base product
LSF_Manager   Boolean   N/A   LSF Manager product
lsf_manager   Boolean   N/A   LSF Manager product
LSF_JobSchedu Boolean   N/A   JobScheduler product
lsf_js        Boolean   N/A   JobScheduler product
LSF_Make      Boolean   N/A   Make product
lsf_make      Boolean   N/A   Make product
LSF_Parallel  Boolean   N/A   Parallel product
lsf_parallel  Boolean   N/A   Parallel product
LSF_Analyzer  Boolean   N/A   Analyzer product
lsf_analyzer  Boolean   N/A   Analyzer product
mips          Boolean   N/A   MIPS architecture
sparc         Boolean   N/A   SUN SPARC
hpux          Boolean   N/A   HP-UX UNIX
aix           Boolean   N/A   AIX UNIX
irix          Boolean   N/A   IRIX UNIX
rms           Boolean   N/A   RMS
pset          Boolean   N/A   PSET
dist          Boolean   N/A   DIST
slurm         Boolean   N/A   SLURM
cpuset        Boolean   N/A   CPUSET
solaris       Boolean   N/A   SUN SOLARIS
fs            Boolean   N/A   File server
cs            Boolean   N/A   Compute server
frame         Boolean   N/A   Hosts with FrameMaker licence
bigmem        Boolean   N/A   Hosts with very big memory
diskless      Boolean   N/A   Diskless hosts
alpha         Boolean   N/A   DEC alpha
linux         Boolean   N/A   LINUX UNIX
nt            Boolean   N/A   Windows NT
mpich_gm      Boolean   N/A   MPICH GM MPI
lammpi        Boolean   N/A   LAM MPI
mpichp4       Boolean   N/A   MPICH P4 MPI
mvapich       Boolean   N/A   Infiniband MPI
sca_mpimon    Boolean   N/A   SCALI MPI
ibmmpi        Boolean   N/A   IBM POE MPI
hpmpi         Boolean   N/A   HP MPI
sgimpi        Boolean   N/A   SGI MPI
intelmpi      Boolean   N/A   Intel MPI
crayxt3       Boolean   N/A   Cray XT3 MPI
crayx1        Boolean   N/A   Cray X1 MPI
mpich_mx      Boolean   N/A   MPICH MX MPI
mpichsharemem Boolean   N/A   MPICH Shared Memory
mpich2        Boolean   N/A   MPICH2
mg            Boolean   N/A   Management hosts
openmpi       Boolean   N/A   OPENMPI
bluegene      Boolean   N/A   BLUEGENE
define_ncpus_ Boolean   N/A   ncpus := procs
define_ncpus_ Boolean   N/A   ncpus := cores
define_ncpus_ Boolean   N/A   ncpus := threads
Platform_HPC  Boolean   N/A   platform hpc license
platform_hpc  Boolean   N/A   platform hpc license
fluent        Boolean   N/A   fluent availability
ls_dyna       Boolean   N/A   ls_dyna availability
nastran       Boolean   N/A   nastran availability
pvm           Boolean   N/A   pvm availability
openmp        Boolean   N/A   openmp availability
ansys         Boolean   N/A   ansys availability
blast         Boolean   N/A   blast availability
gaussian      Boolean   N/A   gaussian availability
lion          Boolean   N/A   lion availability
scitegic      Boolean   N/A   scitegic availability
schroedinger  Boolean   N/A   schroedinger availability
hmmer         Boolean   N/A   hmmer availability
type           String   N/A   Host type
model          String   N/A   Host model
status         String   N/A   Host status
hname          String   N/A   Host name

TYPE_NAME
UNKNOWN_AUTO_DETECT
DEFAULT
DEFAULT
CRAYX1
DigitalUNIX
ALPHA5
ALPHASC
HPPA
IBMAIX532
IBMAIX564
LINUX
LINUX2
LINUXAXP
LINUX86
LINUXPPC
LINUX64
DLINUX
DLINUX64
DLINUXAXP
SLINUX
SLINUX64
NECSX6
NECSX8
NTX86
NTX64
NTIA64
SGI6
SUNSOL
SOL732
SOL64
SGI64
SGI65
SGI658
SOLX86
SOLX8664
HPPA11
HPUXIA64
MACOSX
LINUXPPC64
LINUX_ARM
X86_64
SX86_64
IA64
DIA64
SIA64

MODEL_NAME      CPU_FACTOR      ARCHITECTURE
Intel_EM64T          60.00      x15_6789_IntelRXeon
EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


This says mem is Available memory (Mbytes). But on your system it is actually Gbytes? So I can't get the information by parsing lsinfo.

The change in documentation happened around LSF 9. The Platform LSF Configuration Reference section on lsf.conf includes this for LSF_UNIT_FOR_LIMITS:

This parameter alters the meaning of all numeric values in lsb.resources to match the unit set, whether gpool, limits, hostexport, etc. It also controls the resource rusage attached to the job and the memory amount that defines the size of a package in GSLA.

There's nothing about resources in the LSF 7.0.6 version of that document.

Similarly, Administering Platform LSF 9.1.1's section on Load indices includes several new mentions of LSF_UNIT_FOR_LIMITS within guillemets, and change lines within the PDF version, which I presume is IBM's way of saying this section has changed.

I don't know whether the change was due to different behavior in LSF 9, or what is effectively a doc bug in LSF 7. How do you know that select[mem>2148 && tmp>48] asks for >2148 GB on your system?

EricR86 commented 10 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


On our system the select statement uses the LSF_UNIT_FOR_LIMITS, which in our case is GB; I guess this is what the doc you cite refers to.

I guess I don't know per se what it's requesting, but for everything else I use GB size requests (i.e. select[mem>8] identifies nodes with 8 GB available), and segway jobs won't run with the setting at 2148; changing to GB in cluster.lsf.py lets them run.

-- Jay Hesselberth Assistant Professor Biochemistry & Molecular Genetics University of Colorado School of Medicine off: (303) 724-5384 lab: (303) 724-5486 jay.hesselberth@gmail.com

EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


Can you get me the result of bjobs -l for a stalled job due to requesting select[mem>2148]?

EricR86 commented 10 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


Here's a test script that stays in PEND state.

#!bash

jhessel@amc-tesla ~/devel/segway
$ bjobs -l

Job <510660>, Job Name <test>, User <jhessel>, Project <default>, Status <PEND>
                     , Queue <normal>, Job Priority <50>, Command <#! /usr/bin/
                     env bash; #BSUB -J test;#BSUB -R "select[mem>2148] rusage[
                     mem=2148]"; sleep 100>
Fri Mar 28 14:36:05: Submitted from host <amc-tesla>, CWD <$HOME/devel/segway>,
                      Requested Resources <select[mem>2148] rusage[mem=2148]>;
 PENDING REASONS:
 Not specified in job submission: 3 hosts;
 Job's resource requirements not satisfied: 14 hosts;
 Closed by LSF administrator: 1 host;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -    1.5   1.5    -       -     -    -     -     -      -      -  
 loadStop    -    3.0   3.0    -       -     -    -     -     -      -      -  

             root maxroot processes clockskew 
 loadSched     -       -         -         -  
 loadStop      -       -         -         -  
EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


The reference for bsub in LSF 7.0 (2007) says it should honor LSF_UNIT_FOR_LIMITS for -R. LSF 6.0 doesn't seem to have LSF_UNIT_FOR_LIMITS. I am going to assume I got this one wrong.

EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


I've checked in a fix to the issue2 branch, ac94de1674933292780ff5fd5251626a3bbc8328. If this fixes it for you, I will merge to default.

EricR86 commented 10 years ago

Original comment by Jay Hesselberth (Bitbucket: jayhesselberth, GitHub: jayhesselberth).


Fixed, LSF jobs run.

EricR86 commented 10 years ago

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


close issue2. issue #2

→ \<\<cset d2932c2127a9ef8141ddd9a463e4079ccc6325fa>>