radical-collaboration / extasy-grlsd

Repository to hold the input data and scripts for the ExTASY gromacs-lsdmap work
1 stars 1 forks source link

persistent duplicate error #88

Closed euhruska closed 5 years ago

euhruska commented 5 years ago

this error happened for a while, until I restarted my whole tmux environment, then it went away. What causes it? @vivek-bala

/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/topology.py:149: UserWarning: MongoClient opened before fork. Create MongoClient only a
fter forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "
Process task-manager:
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 271, in _tmgr
    raise EnTKError(text=ex)
TypeError: __init__() got an unexpected keyword argument 'text'
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/entk/execman/rp/task_manager.py", line 177, in _tmgr
    umgr = rp.UnitManager(session=rmgr._session)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 148, in __init__
    self._session._register_umgr(self)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/session.py", line 781, in _register_umgr
    self._dbs.insert_umgr(umgr.as_dict())
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/radical/pilot/db/database.py", line 354, in insert_umgr
    result = self._c.insert(umgr_doc)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 3161, in insert
    check_keys, manipulate, write_concern)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 607, in _insert
    bypass_doc_val, session)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 595, in _insert_one
    acknowledged, _insert_command, session)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1243, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1196, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/collection.py", line 592, in _insert_command
    _check_write_command_response(result)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/helpers.py", line 217, in _check_write_command_response
    _raise_last_write_error(write_errors)
  File "/scratch1/eh22/conda/envs/extasy11/lib/python2.7/site-packages/pymongo/helpers.py", line 198, in _raise_last_write_error
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
DuplicateKeyError: E11000 duplicate key error index: clementigroup.rp.session.leonardo.rice.edu.eh22.017788.0044.$_id_ dup key: { : "umgr.0000" }
vivek-bala commented 5 years ago

It looks like there might be some stray processes from earlier runs which did not terminate correctly. Did you have any such runs? Killing the old tmux session probably cleaned up those processes.

euhruska commented 5 years ago

I try to kill earlier processes with Ctrl+C, if that's not fast enough with Ctrl+Z. Unfortunately, when I check with ps I see dozens of zombie processes. Killing all my processes fixes it always.

andre-merzky commented 5 years ago

Issuing a Ctrl-C will tell RP to terminate all processes, and also to connect to all pilots to shut them down. Specifically the latter can be time consuming, and it might take a while until the whole stack is torn down. If you press CTRL-C again, or suspend the process and kill it (CTRL-Z; kill %1), then the teardown procedure itself is interrupted, and you are likely to end up with stale processes. Worse, you might end up with stale and unusable pilots wasting your allocation...

euhruska commented 5 years ago

is there a possibility to have the current entk log show how all processes are shut down? I would be less inclined to CTRL-Z if I see something is happening (the shut down).

vivek-bala commented 5 years ago

I am not sure I understand the 'how' part. Do share that, we are open to feedback on usability. Maybe we can print a msg such as "Clean termination may take up to X mins, please wait". Is that what you are suggesting?

Let me mention that EnTK, by design, does not know the architecture or implementation of RP, and neither do we inspect its internal components. We interact through the API exposed which tells us whether an RP session is closed or active. We do know the percentage of the termination event that is completed, so it is hard (maybe even an overkill) to say "X%f of termination done". But I believe if you enable the RP logs (RADICAL_PILOT_VERBOSE=DEBUG), you will see that information (written to a file, not stdout).

Does that help?

euhruska commented 5 years ago

I tried waiting for CTRL+C to shut down everything properly, but it's been 30 minutes and still didn't finish.

andre-merzky commented 5 years ago

Uh oh, that's way to long... :/ Sorry, then I need to ask for some more details:

What OS are you on? What's the RCT stack you are running right now? Can you create and pack a set of logfiles for this run (please grab them before you issue a kill command to kill the processes manually). Thanks!

euhruska commented 5 years ago

@vivek-bala yes, messages like starting termination step 1, step 1 finished, waiting for step2,..., all termination steps finished would be useful.

@andre-merzky os: Red Hat Enterprise Linux 7 radical-stack

  python               : 2.7.14
  pythonpath           :
  virtualenv           : extasy11

  radical.analytics    : v0.45.2-102-gaec2e1d@devel
  radical.entk         : 0.7.4-0.7.4@HEAD-detached-at-af273638
  radical.pilot        : 0.50.8-v0.50.8-6-g6e36785f@devel
  radical.utils        : 0.50.1-v0.50.1-1-gce60836@devel
  saga                 : 0.50.0-v0.50.0@devel

logs: re.session.leonardo.rice.edu.eh22.017795.0016.zip rp.session.leonardo.rice.edu.eh22.017795.0018.zip log_extasy_tica3_villin_small.log

andre-merzky commented 5 years ago

Thanks!

@vivek-bala , can you please have a look at the errors in log_extasy_tica3_villin_small.log? Not sure if they screw with RPs termination sequence.

@euhruska: It seems like the pilot termination gets interrupted, but I am not sure why, yet. Could you please also attach the bootstrap* files from the pilot sandbox? Thanks!

euhruska commented 5 years ago

My main issue is that this persistent duplicate error happens always for the villin system and I can't run this system at all because of that. I tried cleaning up the mongodb and restarting everything, but didn't help. Some other systems like BBA work at the same time. Maybe it's because of the interrupter pilot terminations. Here the bootstrap files: bootstrap-files.zip

euhruska commented 5 years ago

update: since I moved from 'two.radical-project.org' to my own rabbitmq this issue didn't happen again.

vivek-bala commented 5 years ago

Thanks, I think the issue might be similar to #93. Please see my comment on that. Makes sense on the rabbitmq changes too, since that minimizes any communication delays.