radical-collaboration / extasy-grlsd

Repository to hold the input data and scripts for the ExTASY gromacs-lsdmap work
1 stars 1 forks source link

MongoClient opened before fork. #74

Closed euhruska closed 5 years ago

euhruska commented 6 years ago

what does this mean?

2018-08-22 00:46:14,974: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : DEBUG   : stage.0000 with state SCHEDULED synced with AppManager
2018-08-22 00:46:14,974: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Transition of stage.0000 to new state SCHEDULED successful
2018-08-22 00:46:14,974: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Enqueue thread terminated
2018-08-22 00:46:15,253: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating dequeue-thread                                                                                              2018-08-22 00:46:15,366: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Terminated dequeue thread
2018-08-22 00:46:15,660: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : DEBUG   : WFprocessor process terminated
2018-08-22 00:46:15,662: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating synchronizer thread
2018-08-22 00:46:15,995: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Synchronizer thread terminated
2018-08-22 00:46:15,995: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating task manager process
/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
2018-08-22 00:46:35,942: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Task manager process closed
2018-08-22 00:46:44,162: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Hearbeat thread terminated
vivek-bala commented 6 years ago

This seems to be non-fatal. My guess is a MongoClient object is being passed between processes in RP. @andre-merzky Do you want a RP ticket for this?

vivek-bala commented 6 years ago

Thanks for pointing this out Eugen!

euhruska commented 6 years ago

in this case fatal?

2018-08-22 15:31:06,561: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Restarting heartbeat thread
2018-08-22 15:31:06,561: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Starting heartbeat thread
2018-08-22 15:31:06,665: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Task Manager process started
2018-08-22 15:31:07,256: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
Process task-manager:
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 271, in _tmgr
    raise EnTKError(text=ex)
TypeError: __init__() got an unexpected keyword argument 'text'
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 177, in _tmgr
    umgr = rp.UnitManager(session=rmgr._session)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 148, in __init__
    self._session._register_umgr(self)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/session.py", line 781, in _register_umgr
    self._dbs.insert_umgr(umgr.as_dict())
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/db/database.py", line 354, in insert_umgr
    result = self._c.insert(umgr_doc)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 3161, in insert
    check_keys, manipulate, write_concern)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 607, in _insert
    bypass_doc_val, session)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 595, in _insert_one
    acknowledged, _insert_command, session)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1243, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1196, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 592, in _insert_command
    _check_write_command_response(result)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/helpers.py", line 217, in _check_write_command_response
    _raise_last_write_error(write_errors)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/helpers.py", line 198, in _raise_last_write_error
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
DuplicateKeyError: E11000 duplicate key error index: clementigroup.rp.session.leonardo.rice.edu.eh22.017765.0068.$_id_ dup key: { : "umgr.0000" }

2018-08-22 15:31:37,669: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating heartbeat thread
2018-08-22 15:31:37,669: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Hearbeat thread terminated
2018-08-22 15:31:38,949: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating tmgr process
2018-08-22 15:31:38,949: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Task manager process closed
andre-merzky commented 6 years ago

The last error is fatal - it indicates that an RP session ID has been used twice, and RP gets confused by existing DB entries.

I am uncertain about the first error (Mongo before fork) - we never observed this before, and we do not use the MongoDB handle across forked processes AFAICT. Is that problem reproducible, and do you happen to have the full stack trace for that one? Thanks!

euhruska commented 6 years ago

Mongo before fork has happened before, but I can't reproduce it. It usually doesn't happen at the begin of the run, but somewhere in the middle

vivek-bala commented 6 years ago

Can you attach the full verbose EnTK logs from stdout and the rp logs as well?

euhruska commented 6 years ago

rp.session.leonardo.rice.edu.eh22.017765.0068.zip entk-ntl9-2-error.log

I don't have the rp log

vivek-bala commented 6 years ago

I think your connection to the system might be unstable.

2018-08-22 16:40:19,051: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: DEBUG   : pilot.0000 calls cb <bound method ComputePilot._default_state_cb of ['pilot.0000', 'ncsa.bw_aprun', 'CANCELED']>
2018-08-22 16:40:19,051: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: INFO    : [Callback]: pilot pilot.0000 state: CANCELED.
2018-08-22 16:40:19,051: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: DEBUG   : pmgr calls cb pilot.0000 for <function _pilot_state_cb at 0x7f838abbf7d0>
2018-08-22 16:40:19,051: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: ERROR   : abort: IOError(32, 'Broken pipe')
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/threads.py", line 375, in _run
    if not self.work_cb():
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 954, in work_cb
    ret = self._cb()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 264, in _state_pull_cb
    if not self._update_pilot(pilot_dict, publish=True):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 337, in _update_pilot
    self._pilots[pid]._update(pilot_dict)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 188, in _update
    self._pmgr._call_pilot_callbacks(self, self.state)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 364, in _call_pilot_callbacks
    else      : cb(pilot_obj, state)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 111, in _pilot_state_cb
    self._logger.info('Pilot %s state: %s' % (pilot.uid, state))
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 1167, in info
    self._log(INFO, msg, args, **kwargs)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 1286, in _log
    self.handle(record)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 1296, in handle
    self.callHandlers(record)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 1336, in callHandlers
    hdlr.handle(record)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/logging/__init__.py", line 759, in handle
    self.emit(record)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/utils/logger.py", line 159, in emit
    self.stream.write(self.format(record) + self._term)
IOError: [Errno 32] Broken pipe

Hey @andre-merzky, can you remind me what the state of the pilot is when you run out of walltime? FAILED or CANCELED?

andre-merzky commented 6 years ago

@vivek-bala: the pilot goes to FAILED when it gets killed by the batch system at end of walltime.

vivek-bala commented 6 years ago

Thanks Andre! We can eliminate low walltime in that case. Eugen, this does look like connectivity issue as far as I can see.

euhruska commented 6 years ago

weird, that server is connected by LAN, so the internet is always there. Definitely no wifi connection issue. Maybe some software connection issues with rabbitmq?

vivek-bala commented 6 years ago

I don't think so, the error is not at the pika layer (the module that interacts with the rabbitmq server). Its at the radical utils layer so its likely between the client and the HPC system or within the client itself.

@andre-merzky I noticed the issue is from the logger in utils, I haven't seen this before. I have noticed the broken pipe error from the pty layer. Do you have any idea what this might be?