Closed mturilli closed 4 years ago
My friend is Dr. Davide Del Vento at ddvento@ucar.edumailto:ddvento@ucar.edu
You can try asking him directly.
Guido
Thanks Guido, I will do that.
I got the required information. CISL supports virtualenv in it's custom wrapper. Applying this to our stack needs some work, but should be doable. Alas, Cheyenne went up and down last week, to I did not make progress toward implementing that...
Large PSU allocation, 1 to 2K core hours per experiment, several experiments and needs for scale. This will be required by the end of October.
This is now escalated. This might be required to update a presentation that is coming up in 2 months. We should see whether fixing Cheyenne is reasonably fast. If not, we should move to a different resource to be defined.
We got feedback from the sysadmins and we are now testing a new virtual environment. We should know whether it works by the beginning of next week
See radical-cybertools/radical.pilot/pull/1999
Hooray! I'm going to give it a try as soon as possible! Thank you!
On Thu, Nov 21, 2019, 07:44 Andre Merzky notifications@github.com wrote:
See radical-cybertools/radical.pilot/pull/1999 https://github.com/radical-cybertools/radical.pilot/pull/1999
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/hpc-workflows/issues/97?email_source=notifications&email_token=AD5JDWWE6K4EHTQEBAZAZL3QUZ7EBA5CNFSM4IOMYJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2DDOI#issuecomment-557068729, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5JDWVPPL2HOA6NPYHTBDDQUZ7EBANCNFSM4IOMYJ3A .
@Weiming-Hu : Excellent, thank you! Please note that you need the devel branches for radical.utils and radical.saga, and remember that this is Python 3 :-) My module set is:
(ve) cheyenne2 amerzky ~/radical/radical.pilot [fix/cheyenne] $ module list
Currently Loaded Modules:
1) python/3.6.8 2) ncarenv/1.3 3) gnu/8.3.0 4) netcdf/4.6.3 5) ncarcompilers/0.5.0 6) mpt/2.19
I did not yet try to use the intel compilers or OpenMPI - let me please know if you need those, then we would add some configuration options.
That's sounds good. Thanks.
On Thu, Nov 21, 2019, 09:20 Andre Merzky notifications@github.com wrote:
@Weiming-Hu https://github.com/Weiming-Hu : Excellent, thank you! Please note that you need the devel branches for radical.utils and radical.saga, and remember that this is Python 3 :-) MY module set is:
(ve) cheyenne2 amerzky ~/radical/radical.pilot [fix/cheyenne] $ module list
Currently Loaded Modules: 1) python/3.6.8 2) ncarenv/1.3 3) gnu/8.3.0 4) netcdf/4.6.3 5) ncarcompilers/0.5.0 6) mpt/2.19
I did not yet try to use the intel compilers or OpenMPI - let me please know if you need those, then we need to add some configuration options.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/hpc-workflows/issues/97?email_source=notifications&email_token=AD5JDWWJWVX72X7TKNOVJG3QU2KK7A5CNFSM4IOMYJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2MD3I#issuecomment-557105645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5JDWRO3FEFL3OCAHXQHKTQU2KK7ANCNFSM4IOMYJ3A .
I ported my code to python 3. There seems like a small issue with mongoDB (mlab).
Setting up RabbitMQ system n/a
new session: [re.session.cheyenne5.wuh20.018285.0003] \
database : [mongodb://wuh20:example123@ds137271.mlab.com:37271/entk] err
All components terminated
Traceback (most recent call last):
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
return func(session, sock_info, retryable)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/collection.py", line 595, in _insert_command
retryable_write=retryable_write)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/pool.py", line 613, in command
user_fields=user_fields)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/network.py", line 167, in command
parse_write_concern_error=parse_write_concern_error)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/helpers.py", line 159, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: Transaction numbers are only allowed on storage engines that support document-level locking
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/session.py", line 173, in _initialize_primary
cfg=self._cfg, log=self._log)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/db/database.py", line 86, in __init__
'connected' : self._connected})
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/collection.py", line 3182, in insert
check_keys, manipulate, write_concern)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/collection.py", line 612, in _insert
bypass_doc_val, session)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/collection.py", line 600, in _insert_one
acknowledged, _insert_command, session)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1491, in _retryable_write
return self._retry_with_session(retryable, func, s, None)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1425, in _retry_with_session
raise OperationFailure(errmsg, exc.code, exc.details)
pymongo.errors.OperationFailure: This MongoDB deployment does not support retryable writes. Please add retryWrites=false to your connection string.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./runme.py", line 181, in <module>
amgr.run()
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 414, in run
self._rmgr._submit_resource_request()
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 148, in _submit_resource_request
self._session = rp.Session(uid=self._sid)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/session.py", line 147, in __init__
self._initialize_primary(dburl)
File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/session.py", line 186, in _initialize_primary
raise RuntimeError ('session create failed [%s]' % dburl)
RuntimeError: session create failed [mongodb://wuh20:example123@ds137271.mlab.com:37271/entk]
@Weiming-Hu , as discussed in our meeting, this is probably due to using mlab
that, after MongoDB acquisition, is not supported anymore by RCT. Please ping me on Slack and I will give you an alternative mongodb endpoint.
@andre-merzky , @Weiming-Hu and I tested his script on Cheyenne. We get the following error:
verify python viability: /glade/scratch/amerzky/radical.pilot.sandbox/ve.rp.cheyenne.2019.11.15/bin/python ... ok
verify module viability: radical.pilot ...Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/pilot/__init__.py", line 10, in <module>
import radical.utils as _ru
File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/utils/__init__.py", line 14, in <module>
from .plugin_manager import PluginManager
File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/utils/plugin_manager.py", line 15, in <module>
from .logger import Logger
File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/utils/logger.py", line 46, in <module>
from .config import DefaultConfig
File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/utils/config.py", line 125, in <module>
import munch
ModuleNotFoundError: No module named 'munch'
failed
python installation cannot load module radical.pilot - abort
We use a static VE on Cheyenne, and that has been created before the RCT stack introduced munch as a new dependency. This can be fixed by installing the module in the static ve, like:
$ source /glade/scratch/amerzky/radical.pilot.sandbox/ve.rp.cheyenne.2019.11.15/bin/activate
$ pip install munch
$ deactivate
Let me do that (you may not have write permissions)
So, in a sad turn of events, it seems that we are back to square-1 with our Cheyenne support. We now face the exact same problem we originally had with virtualenv
when using ncar_pylib
. As a recap: the last time we did not get any useful feedback from support, so Guido pointed us to a contact which suggested to use ncar_pylib
instead of virtualenv. That got us runnin - but now we see the exact same problem with ncar_pylib.
The NCAR documentation is here. Following that (and that's what we did in the past), this should work:
$ ncar_pylib -c 20190627 `pwd`/ve.rp
$ ncar_pylib ve.rp
(ve.rp) $ pip install setproctitle
(ve.rp) $ python -c 'import setproctitle'
but that now results in
$ python -c 'import setproctitle'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: /glade/u/home/amerzky/radical/radical.pilot/ve.rp/lib/python3.6/site-packages/setproctitle.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __intel_sse2_strdup
So, now we can experiment around a bit, and open another ticket - but we are back to were we were, no working python on Cheyenne. I'll open a new ticket for this.
Hello,
At this point, can we use a different system other than Cheyenne? WE are really in a rush to produce results, and if there is a way to get it to run elsewhere, then it is a perfectly fine solution.
Guido
Hi Guido @Cervone, that's probably sensible. The first iteration with Cheyenne support was useless again - I'll keep that channel open of course, but ...
Matteo and I discussed this, and we should try to get you running on Stampede2 meanwhile, just to unstall. Can you remind me please, do you have accounts / allocation on Stampede2?
Best, Andre.
Weiming, what do you think?
I don't think it's a problem for me. I need to move the dataset. As soon as I'm assigned with an account and allocation, I can start running them.
@mturilli: can you advise on account creation, please?
@mturilli I might already have one. I can log onto the XSEDE portal and Stampede2. Does that mean I can start transferring directly? I'm not sure where I should put my data though. Could you help me with that? Thank you.
@Weiming-Hu do you see your allocation on Stampede2 from your xsede account? How much data do you have to transfer?
I can see the allocation status to be 2714.0 / 95896.0 SUs. I suppose that I can use this allocation?
For data storage, I need to transfer about 10 T.
On Sun, Feb 23, 2020 at 8:04 PM Matteo Turilli notifications@github.com wrote:
@Weiming-Hu https://github.com/Weiming-Hu do you see your allocation on Stampede2 from your xsede account? How much data do you have to transfer?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/hpc-workflows/issues/97?email_source=notifications&email_token=AD5JDWQKEG3LHDKFMXCI32DREMMKBA5CNFSM4IOMYJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMWM55I#issuecomment-590139125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5JDWRXZTZVYF3YFCQFACLREMMKBANCNFSM4IOMYJ3A .
--
Weiming Hu Geography Ph.D. Candidate Geoinformatics and Earth Observation Laboratory The Pennsylvania State University http://geolab.psu.edu/ https://weiming-hu.github.io/ Email: weiming@psu.edu
("-''-/").___..--''"
-.
`6 6 ) -. ( ).
-.__.) We are..... (_Y_.)' ._ )
. `. ``-..-' Penn State!
..`--'..-/ /--'_.' ,' Home of the
(il),-'' (li),' ((!.-' Nittany Lions!
Yes, that means you have access to our allocation. 10T in XSEDE terms are a lot of data. We will have to see how to get you that space but, meanwhile, you could use $SCRATCH on Stampede2? It has no space limit but has a purge policy if access time is more than 10 days old. I know this is suboptimal but I also know you are in a hurry.
From https://portal.xsede.org/tacc-stampede2#table3: "*The operating system updates a file's access time when that file is modified on a login or compute node. Reading or executing a file/script on a login node does not update the access time, but reading or executing on a compute node does update the access time. This approach helps us distinguish between routine management tasks (e.g. tar, scp) and production use. Use the command "ls -ul" to view access times."
Sounds good. I'll start doing this right away. Thank you.
On Mon, Feb 24, 2020, 05:52 Matteo Turilli notifications@github.com wrote:
Yes, that means you have access to our allocation. 10T in XSEDE terms are a lot of data. We will have to see how to get you that space but, meanwhile, you could use $SCRATCH on Stampede2? It has no space limit but has a purge policy if access time is more than 10 days old. I know this is suboptimal but I also know you are in a hurry.
From https://portal.xsede.org/tacc-stampede2#table3: "*The operating system updates a file's access time when that file is modified on a login or compute node. Reading or executing a file/script on a login node does not update the access time, but reading or executing on a compute node does update the access time. This approach helps us distinguish between routine management tasks (e.g. tar, scp) and production use. Use the command "ls -ul" to view access times."
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/hpc-workflows/issues/97?email_source=notifications&email_token=AD5JDWUA2VLWTESMRUBIYXTREORFXA5CNFSM4IOMYJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMXLOQI#issuecomment-590264129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5JDWUYYHUQILELBHSPEXLREORFXANCNFSM4IOMYJ3A .
Moved to Stampede2, avoiding Cheyenne for the foreseeable future.
If we ever tackle this again, we may want to try our own Python deployment.
The discussions with NCAR's support staff are somewhat ridiculous, to be blunt, and I don't see this being resolved anytime soon. Do we have any other line of support for Cheyenne apart from the official support channels? @mturilli : any idea who we could contact?