Closed euhruska closed 5 years ago
This seems to be non-fatal. My guess is a MongoClient object is being passed between processes in RP. @andre-merzky Do you want a RP ticket for this?
Thanks for pointing this out Eugen!
in this case fatal?
2018-08-22 15:31:06,561: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Restarting heartbeat thread
2018-08-22 15:31:06,561: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Starting heartbeat thread
2018-08-22 15:31:06,665: radical.entk.task_manager.0000: task-manager : MainThread : INFO : Task Manager process started
2018-08-22 15:31:07,256: radical.entk.task_manager.0000: MainProcess : heartbeat : INFO : Sent heartbeat request
/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
"MongoClient opened before fork. Create MongoClient only "
Process task-manager:
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 271, in _tmgr
raise EnTKError(text=ex)
TypeError: __init__() got an unexpected keyword argument 'text'
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 177, in _tmgr
umgr = rp.UnitManager(session=rmgr._session)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 148, in __init__
self._session._register_umgr(self)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/session.py", line 781, in _register_umgr
self._dbs.insert_umgr(umgr.as_dict())
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/db/database.py", line 354, in insert_umgr
result = self._c.insert(umgr_doc)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 3161, in insert
check_keys, manipulate, write_concern)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 607, in _insert
bypass_doc_val, session)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 595, in _insert_one
acknowledged, _insert_command, session)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1243, in _retryable_write
return self._retry_with_session(retryable, func, s, None)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1196, in _retry_with_session
return func(session, sock_info, retryable)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 592, in _insert_command
_check_write_command_response(result)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/helpers.py", line 217, in _check_write_command_response
_raise_last_write_error(write_errors)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/helpers.py", line 198, in _raise_last_write_error
raise DuplicateKeyError(error.get("errmsg"), 11000, error)
DuplicateKeyError: E11000 duplicate key error index: clementigroup.rp.session.leonardo.rice.edu.eh22.017765.0068.$_id_ dup key: { : "umgr.0000" }
2018-08-22 15:31:37,669: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating heartbeat thread
2018-08-22 15:31:37,669: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Hearbeat thread terminated
2018-08-22 15:31:38,949: radical.entk.appmanager.0000: MainProcess : MainThread : INFO : Terminating tmgr process
2018-08-22 15:31:38,949: radical.entk.task_manager.0000: MainProcess : MainThread : INFO : Task manager process closed
The last error is fatal - it indicates that an RP session ID has been used twice, and RP gets confused by existing DB entries.
I am uncertain about the first error (Mongo before fork) - we never observed this before, and we do not use the MongoDB handle across forked processes AFAICT. Is that problem reproducible, and do you happen to have the full stack trace for that one? Thanks!
Mongo before fork
has happened before, but I can't reproduce it. It usually doesn't happen at the begin of the run, but somewhere in the middle
Can you attach the full verbose EnTK logs from stdout and the rp logs as well?
rp.session.leonardo.rice.edu.eh22.017765.0068.zip entk-ntl9-2-error.log
I don't have the rp log
I think your connection to the system might be unstable.
2018-08-22 16:40:19,051: pmgr.0000 : MainProcess : pmgr.0000.idler._state_pull_cb: DEBUG : pilot.0000 calls cb <bound method ComputePilot._default_state_cb of ['pilot.0000', 'ncsa.bw_aprun', 'CANCELED']>
2018-08-22 16:40:19,051: pmgr.0000 : MainProcess : pmgr.0000.idler._state_pull_cb: INFO : [Callback]: pilot pilot.0000 state: CANCELED.
2018-08-22 16:40:19,051: pmgr.0000 : MainProcess : pmgr.0000.idler._state_pull_cb: DEBUG : pmgr calls cb pilot.0000 for <function _pilot_state_cb at 0x7f838abbf7d0>
2018-08-22 16:40:19,051: pmgr.0000 : MainProcess : pmgr.0000.idler._state_pull_cb: ERROR : abort: IOError(32, 'Broken pipe')
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/threads.py", line 375, in _run
if not self.work_cb():
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 954, in work_cb
ret = self._cb()
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 264, in _state_pull_cb
if not self._update_pilot(pilot_dict, publish=True):
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 337, in _update_pilot
self._pilots[pid]._update(pilot_dict)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 188, in _update
self._pmgr._call_pilot_callbacks(self, self.state)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 364, in _call_pilot_callbacks
else : cb(pilot_obj, state)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 111, in _pilot_state_cb
self._logger.info('Pilot %s state: %s' % (pilot.uid, state))
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 1167, in info
self._log(INFO, msg, args, **kwargs)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 1286, in _log
self.handle(record)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 1296, in handle
self.callHandlers(record)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 1336, in callHandlers
hdlr.handle(record)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 759, in handle
self.emit(record)
File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/logger.py", line 159, in emit
self.stream.write(self.format(record) + self._term)
IOError: [Errno 32] Broken pipe
Hey @andre-merzky, can you remind me what the state of the pilot is when you run out of walltime? FAILED or CANCELED?
@vivek-bala: the pilot goes to FAILED
when it gets killed by the batch system at end of walltime.
Thanks Andre! We can eliminate low walltime in that case. Eugen, this does look like connectivity issue as far as I can see.
weird, that server is connected by LAN, so the internet is always there. Definitely no wifi connection issue. Maybe some software connection issues with rabbitmq?
I don't think so, the error is not at the pika layer (the module that interacts with the rabbitmq server). Its at the radical utils layer so its likely between the client and the HPC system or within the client itself.
@andre-merzky I noticed the issue is from the logger in utils, I haven't seen this before. I have noticed the broken pipe error from the pty layer. Do you have any idea what this might be?
what does this mean?