radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

TUU usecase 1728 replicas run fails on Stampede #35

Closed antonst closed 8 years ago

antonst commented 9 years ago

With:

Traceback (most recent call last):
  File "launch_simulation_pattern_b_3d_tuu.py", line 53, in <module>
    pilot_kernel.run_simulation( replicas, pilot_object, session, md_kernel )
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/RepEx-0.2_feature_tuu_opt3_5f582e7_-py2.7.egg/pilot_kernels/pilot_kernel_pattern_b_multi_d.py", line 210, in run_simulation
    unit_manager.wait_units()
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 649, in wait_units
    units  = self.get_units (unit_ids)
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 590, in get_units
    units = ComputeUnit._get(unit_ids=unit_ids, unit_manager_obj=self)
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/radical/pilot/compute_unit.py", line 133, in _get
    unit_ids=unit_ids
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/radical/pilot/db/database.py", line 497, in get_compute_units
    for obj in cursor:
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1076, in next
    if len(self.__data) or self._refresh():
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1037, in _refresh
    limit, self.__id))
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/pymongo/cursor.py", line 958, in __send_message
    self.__compile_re)
  File "/home/treikalis/opt333/local/lib/python2.7/site-packages/pymongo/helpers.py", line 101, in _unpack_response
    cursor_id)
pymongo.errors.CursorNotFound: cursor id '259254515946571' not valid at server

This is actually related to RP issue #90

andre-merzky commented 9 years ago

Antons mentioned that this happened after a rather long qeueuing time (14hrs), so it might be another manifestation of the MongoDB reconnect error:

https://github.com/radical-cybertools/radical.pilot/issues/442

antonst commented 8 years ago

closing since corresponding RP ticket is closed