Open kevloui opened 5 years ago
Hey @Keverne , thanks for the logs and pointer to the error. Can you confirm you created a static virtual environment on Blue Waters. The steps to do so are documented in https://github.com/radical-collaboration/extasy-bpti/blob/feature/entk-0.7/gmxcoco-bpti/instructions.md#instructions-to-setup-radical-pilot-gromacs-and-coco-on-blue-waters.
@Keverne : this looks indeed like a virtualenv setup problem: the bwpy module on BW has recently been updated. If the problems persist while or after recreating the VE with the instructions given by Vivek, please update this ticket. Thanks!
Thanks @andre-merzky and @vivek-bala. We hadn't done the additional step of setting up a virtual environment on Blue Waters. Will try again and keep you posted!
Hi @andre-merzky and @vivek-bala. I have followed the steps labelled above but still getting a similar error message:
python installation (/mnt/c/scratch/sciteam/suess/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.50.21/bin/python) is not usable - abort
Any idea what could be the problem? What log files would be useful for you to troubleshoot?
Please add all files that match with bootstrap_*.*
. They should give us an idea about why it fails.
This is what i get, hope it makes some sense to you!
/mnt/c/scratch/sciteam/suess/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.50.21/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
It seems similar to the original error in the ticket even after recompiling using the instructions above. I'm not too sure about the source of the error right now. @andre-merzky do you see something incorrect with the procedure/instructions?
Hey Chris - there seems to be something wrong still with your virtualenv. I am sorry to put you through this, I know its annoying, but can you please start over again? Please make sure that the ve you see there is indeed ve.ncsa.bw_aprun.0.50.21
.
$ cd /scratch/sciteam/$USER/radical.pilot.sandbox
$ rm -rf ve.ncsa.bw_aprun.0.50.21
$ wget https://raw.githubusercontent.com/radical-cybertools/radical.pilot/devel/bin/radical-pilot-create-static-ve
$ sh ./radical-pilot-create-static-ve ve.ncsa.bw_aprun.0.50.21 bw
please note the bw
argument at the end. Alas the BW Python installation is very 'special' and our setup script tries to handle that special case if that argument is present.
Please capture all output and post it here. Once the ve exists, please try the following to check if it is viable:
$ module load python
$ bwp-environment
$ source ve.ncsa.bw_aprun.0.50.21/bin/activate
$ which python
$ python -V
Hi @andre-merzky, your BW magic tricks have seemed to work. I will add these to the setup scripts!
Quick note I did have to run bwpy-environ
instead of bwp-environment
maybe this has been updated.
Do I have to have an active connection to blue waters running with the VE running all the time for this to work?
Hi @andre-merzky, your BW magic tricks have seemed to work. I will add these to the setup scripts!
Great!
Quick note I did have to run
bwpy-environ
instead ofbwp-environment
maybe this has been updated.
Ah, I may have typed this incorrectly from memory, apologies.
Do I have to have an active connection to blue waters running with the VE running all the time for this to work?
This somewhat depends on the use case, and @vivek-bala may have more insight into that - but our stack is in general not able to disconnect and reconnect while tasks are running. So you either need an gsissh setup on BW to stay connected, or need to run your application on a headnode.
In either case though you can use screen or tmux to disconnect your terminal session, or run the application as shell background process. The network connectivity between application and BW needs to be stable in all cases.
PS.: we are currently planning how to implement disconnect / reconnect, but that feature is unlikely to arrive soon.
During the first attempt at running the extasy-bpti workflow on bluewaters, we encountered an error where the pilot job stalls. Looking at the bootstrap_0.out file on bluewaters we noticed lines such as:
I have attached the whole of the bootstrap_0.out file here.
bootstrap_0.out.log
Thank you, hope you can help.