lava-nc / lava

A Software Framework for Neuromorphic Computing
https://lava-nc.org
Other
535 stars 136 forks source link

set_partition function in lava/util/slurm.py will not set partition if first listed board by sinfo has status down #754

Open furlong-cmu opened 11 months ago

furlong-cmu commented 11 months ago

Describe the bug When specifying a partition in the use_slurm_host function if there is more than one board in the partition, and the first board(s) returned by sinfo has status down, the set_partition function (line 72 of lava/util/slurm.py) will return a value error that the partition is not found or is down.

To reproduce current behavior

After applying my own fix for bug in https://github.com/lava-nc/lava/issues/753 run code:

from lava.utils import loihi

loihi.use_slurm_host(partition='partition-name', loihi_gen=loihi.ChipGeneration.N3B3)

I get the error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 3
      1 from lava.utils import loihi
----> 3 loihi.use_slurm_host(partition='oheogluch', loihi_gen=loihi.ChipGeneration.N3B3)
      4 use_loihi2 = loihi.is_installed()
      6 # if use_loihi2:

File ~/lava_env/lib/python3.8/site-packages/lava/utils/loihi.py:57, in use_slurm_host(partition, board, loihi_gen)
     54 os.environ["LOIHI_GEN"] = loihi_gen.value
     56 slurm.set_board(board, partition)
---> 57 slurm.set_partition(partition)
     59 global host
     60 host = "SLURM"

File ~/lava_env/lib/python3.8/site-packages/lava/utils/slurm.py:89, in set_partition(partition)
     87 print(partition_info)
     88 if partition_info is None or "down" in partition_info.state:
---> 89     raise ValueError(
     90         f"Attempting to use SLURM for Loihi but partition {partition} "
     91         f"is not found or is down. Run sinfo to check available "
     92         f"partitions.")
     94 os.environ["PARTITION"] = partition

ValueError: Attempting to use SLURM for Loihi but partition oheogluch is not found or is down. Run sinfo to check available partitions.

Expected behavior The expected behaviour is to update the os.environ['PARTITION'] variable to reflect the selected partition.

Environment (please complete the following information):

Additional Context Temporarily fixed this by changing line 88 of lava/util/slurm.py to ignore the "down" partition state. when I run sinfo this seems to occur when the first listed board for the partition has a status "down" even though other boards have status idle.

Possibly symmetric problem in setting boards?