radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

job crashed but still appears as "R" when qstat #7

Closed haoyuanchen closed 10 years ago

haoyuanchen commented 10 years ago

Hi,

I'm trying to run the code on Trestles (I've followed the installing instruction step by step) using the asynchronous scheme 3. After one cycle it exits with this error:

Traceback (most recent call last): File "launch_simulation_scheme_3_amber.py", line 63, in pilot_kernel.run_simulation( replicas, pilot_object, session, md_kernel ) File "/home/chen1990/RepEx/src/radical/repex/pilot_kernels/pilot_kernel_scheme_3.py", line 141, in run_simulation runtime = (check_point - sim_start).total_seconds() AttributeError: 'datetime.timedelta' object has no attribute 'total_seconds' Exception in thread Thread-1 (most likely raised during interpreter shutdown)

However, when I do qstat I can still see the job with the status "R". This is the output when I do "qstat -u chen1990" several minutes after the crashing:

trestles-fe1.local: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


2211587.trestles-fe1.l chen1990 normal SAGA-Python-PBSJ 35523 1 32 -- 00:20:00 R 00:12:53

Just wondering what causes the error and why the PBS job still keeps running.

Thanks! Haoyuan

antonst commented 10 years ago

Thanks for trying out Haoyuan!

I am assuming you are using Python 2.6.x? timedelta.total_seconds() is a new feature in Python 2.7, see: http://docs.python.org/2/library/datetime.html#datetime.timedelta.total_seconds

It is possible to create a workaround for this, but there is a number of other features in RepEx and RP which will require Python 2.7 anyway. Since Python 2.7 was released more than four years ago, I assume it is a reasonable expectation from users to have it installed.

To answer your question about PBS job, even if your application has crashed, Pilot will continue to run on the resource and wait for CUs to execute for the time period resources are allocated for, since error was not "on the pilot end". So it is a good idea to check your running jobs and cancel then manually if this happens.

Thanks, Antons

haoyuanchen commented 10 years ago

Thanks Antons! I tried to do a "module load python/2.7.5" and then run the same thing, it crashed immediately and gave this error:

Error: Couldn't create new session: None Traceback (most recent call last): File "launch_simulation_scheme_3_amber.py", line 63, in pilot_kernel.run_simulation( replicas, pilot_object, session, md_kernel ) File "/home/chen1990/RepEx/src/radical/repex/pilot_kernels/pilot_kernel_scheme_3.py", line 68, in run_simulation unit_manager = radical.pilot.UnitManager(session=session, scheduler=radical.pilot.SCHED_ROUND_ROBIN) File "/home/chen1990/.local/lib/python2.6/site-packages/radical.pilot-0.18.RC2-py2.6.egg/radical/pilot/unit_manager.py", line 112, in init db_connection=session._dbs, AttributeError: 'NoneType' object has no attribute '_dbs'

I tried to use RADICAL_PILOT_VERBOSE=info and I saw that my radical pilot seems to be installed under python2.6/site-packages. Do I need to reinstall radical pilot and saga-python with the python2.7 module loaded?

Thanks a lot! Haoyuan

antonst commented 10 years ago

Haoyuan, there is no need to do module load python2.7.x on a cluster, this is already handled by RP. What you need to do is to install python 2.7 on your laptop (machine you are running RepEx code from) and then reinstall everything. If you will type: python -V in your virtual environment right now it will say Python 2.6.x, right? but this should be Python 2.7.x. If there are any problems with getting this to work, please let me know. For example when you have multiple Python versions on your machine you need to use -p flag to point to specific Python version to be used with your virtual environment:

$ virtualenv -p /usr/bin/python2.7 $HOME/exenv and then you can call: $ source $HOME/exenv/bin/activate

Thanks, Antons

haoyuanchen commented 10 years ago

Hi Antons,

I installed python2.7 on trestles (I'm running repex from there) and reinstalled everything (RP, saga-python and RepEx) following your instructions. However it gave me the same error. "python -V" says 2.7 and "radicalpilot-version" says 0.18.RC2.

I'm not sure why this happens but I'm guessing if it's because I installed py2.7 locally, using the command "make altinstall prefix=~ exec-prefix=~". I did this since I don't (can't) have root/sudo permission there.

Thanks! Haoyuan