radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

RP does not terminate when alloc_nodes >= rm_info.requested_nodes #3085

Closed AymenFJA closed 10 months ago

AymenFJA commented 10 months ago

This issue is leading RP to hang forever while occupying the resources for nothing.

Traceback (most recent call last):
  File "/cache/home/afa64/ve/facts_3.9/lib/python3.9/site-packages/radical/pilot/utils/component.py", line 250, in _work_loop
    self._initialize()
  File "/cache/home/afa64/ve/facts_3.9/lib/python3.9/site-packages/radical/pilot/utils/component.py", line 545, in _initialize
    self.initialize()
  File "/cache/home/afa64/ve/facts_3.9/lib/python3.9/site-packages/radical/pilot/agent/executing/popen.py", line 63, in initialize
    AgentExecutingComponent.initialize(self)
  File "/cache/home/afa64/ve/facts_3.9/lib/python3.9/site-packages/radical/pilot/agent/executing/base.py", line 77, in initialize
    self._rm = rpa.ResourceManager.create(rm_name,
  File "/cache/home/afa64/ve/facts_3.9/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 400, in create
    return impl[name](cfg, rcfg, log, prof)
  File "/cache/home/afa64/ve/facts_3.9/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 150, in __init__
    rm_info = self.init_from_scratch()
  File "/cache/home/afa64/ve/facts_3.9/lib/python3.9/site-packages/radical/pilot/agent/resource_manager/base.py", line 269, in init_from_scratch
    assert alloc_nodes                          >= rm_info.requested_nodes
AssertionError

For example, in CI tests, this will definitely lead to the entire test hanging until the timeout is reached to fetch the logs and debug the error.

mtitov commented 10 months ago

@AymenFJA please try devel branch -> RM is initialized in Agent_0 first before any component starts

p.s. just pushed from fix/rm_init_2 into devel

AymenFJA commented 10 months ago

The PR is merged closing.