radical-collaboration / extasy-bpti

0 stars 1 forks source link

Libpython shared libraries error #16

Open kevloui opened 5 years ago

kevloui commented 5 years ago

During the first attempt at running the extasy-bpti workflow on bluewaters, we encountered an error where the pilot job stalls. Looking at the bootstrap_0.out file on bluewaters we noticed lines such as:

 /mnt/c/scratch/sciteam/louison/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.50.21/bin/python2.7: error while loading shared libraries: libpython2.7.so.1.0: can
not open shared object file: No such file or directory

I have attached the whole of the bootstrap_0.out file here.

bootstrap_0.out.log

Thank you, hope you can help.

vivek-bala commented 5 years ago

Hey @Keverne , thanks for the logs and pointer to the error. Can you confirm you created a static virtual environment on Blue Waters. The steps to do so are documented in https://github.com/radical-collaboration/extasy-bpti/blob/feature/entk-0.7/gmxcoco-bpti/instructions.md#instructions-to-setup-radical-pilot-gromacs-and-coco-on-blue-waters.

andre-merzky commented 5 years ago

@Keverne : this looks indeed like a virtualenv setup problem: the bwpy module on BW has recently been updated. If the problems persist while or after recreating the VE with the instructions given by Vivek, please update this ticket. Thanks!

ChrisSuess commented 5 years ago

Thanks @andre-merzky and @vivek-bala. We hadn't done the additional step of setting up a virtual environment on Blue Waters. Will try again and keep you posted!

ChrisSuess commented 5 years ago

Hi @andre-merzky and @vivek-bala. I have followed the steps labelled above but still getting a similar error message:

python installation (/mnt/c/scratch/sciteam/suess/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.50.21/bin/python) is not usable - abort

Any idea what could be the problem? What log files would be useful for you to troubleshoot?

vivek-bala commented 5 years ago

Please add all files that match with bootstrap_*.*. They should give us an idea about why it fails.

ChrisSuess commented 5 years ago

This is what i get, hope it makes some sense to you!

bootstrap_0.out.log

vivek-bala commented 5 years ago
/mnt/c/scratch/sciteam/suess/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.50.21/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory

It seems similar to the original error in the ticket even after recompiling using the instructions above. I'm not too sure about the source of the error right now. @andre-merzky do you see something incorrect with the procedure/instructions?

andre-merzky commented 5 years ago

Hey Chris - there seems to be something wrong still with your virtualenv. I am sorry to put you through this, I know its annoying, but can you please start over again? Please make sure that the ve you see there is indeed ve.ncsa.bw_aprun.0.50.21.

$ cd /scratch/sciteam/$USER/radical.pilot.sandbox
$ rm -rf ve.ncsa.bw_aprun.0.50.21
$ wget https://raw.githubusercontent.com/radical-cybertools/radical.pilot/devel/bin/radical-pilot-create-static-ve
$ sh ./radical-pilot-create-static-ve ve.ncsa.bw_aprun.0.50.21 bw

please note the bw argument at the end. Alas the BW Python installation is very 'special' and our setup script tries to handle that special case if that argument is present.

Please capture all output and post it here. Once the ve exists, please try the following to check if it is viable:

$ module load python 
$ bwp-environment
$ source ve.ncsa.bw_aprun.0.50.21/bin/activate
$ which python
$ python -V
ChrisSuess commented 5 years ago

Hi @andre-merzky, your BW magic tricks have seemed to work. I will add these to the setup scripts!

Quick note I did have to run bwpy-environ instead of bwp-environment maybe this has been updated.

Do I have to have an active connection to blue waters running with the VE running all the time for this to work?

andre-merzky commented 5 years ago

Hi @andre-merzky, your BW magic tricks have seemed to work. I will add these to the setup scripts!

Great!

Quick note I did have to run bwpy-environ instead of bwp-environment maybe this has been updated.

Ah, I may have typed this incorrectly from memory, apologies.

Do I have to have an active connection to blue waters running with the VE running all the time for this to work?

This somewhat depends on the use case, and @vivek-bala may have more insight into that - but our stack is in general not able to disconnect and reconnect while tasks are running. So you either need an gsissh setup on BW to stay connected, or need to run your application on a headnode.

In either case though you can use screen or tmux to disconnect your terminal session, or run the application as shell background process. The network connectivity between application and BW needs to be stable in all cases.

PS.: we are currently planning how to implement disconnect / reconnect, but that feature is unlikely to arrive soon.