radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK is broken on Cheyenne #97

Closed mturilli closed 4 years ago

andre-merzky commented 4 years ago

The discussions with NCAR's support staff are somewhat ridiculous, to be blunt, and I don't see this being resolved anytime soon. Do we have any other line of support for Cheyenne apart from the official support channels? @mturilli : any idea who we could contact?

cervone commented 4 years ago

My friend is Dr. Davide Del Vento at ddvento@ucar.edumailto:ddvento@ucar.edu

You can try asking him directly.

Guido

andre-merzky commented 4 years ago

Thanks Guido, I will do that.

andre-merzky commented 4 years ago

I got the required information. CISL supports virtualenv in it's custom wrapper. Applying this to our stack needs some work, but should be doable. Alas, Cheyenne went up and down last week, to I did not make progress toward implementing that...

mturilli commented 4 years ago

Large PSU allocation, 1 to 2K core hours per experiment, several experiments and needs for scale. This will be required by the end of October.

mturilli commented 4 years ago

This is now escalated. This might be required to update a presentation that is coming up in 2 months. We should see whether fixing Cheyenne is reasonably fast. If not, we should move to a different resource to be defined.

mturilli commented 4 years ago

We got feedback from the sysadmins and we are now testing a new virtual environment. We should know whether it works by the beginning of next week

andre-merzky commented 4 years ago

See radical-cybertools/radical.pilot/pull/1999

Weiming-Hu commented 4 years ago

Hooray! I'm going to give it a try as soon as possible! Thank you!

On Thu, Nov 21, 2019, 07:44 Andre Merzky notifications@github.com wrote:

See radical-cybertools/radical.pilot/pull/1999 https://github.com/radical-cybertools/radical.pilot/pull/1999

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/hpc-workflows/issues/97?email_source=notifications&email_token=AD5JDWWE6K4EHTQEBAZAZL3QUZ7EBA5CNFSM4IOMYJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2DDOI#issuecomment-557068729, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5JDWVPPL2HOA6NPYHTBDDQUZ7EBANCNFSM4IOMYJ3A .

andre-merzky commented 4 years ago

@Weiming-Hu : Excellent, thank you! Please note that you need the devel branches for radical.utils and radical.saga, and remember that this is Python 3 :-) My module set is:

(ve)  cheyenne2  amerzky  ~/radical/radical.pilot  [fix/cheyenne] $ module list

Currently Loaded Modules:
  1) python/3.6.8   2) ncarenv/1.3   3) gnu/8.3.0   4) netcdf/4.6.3   5) ncarcompilers/0.5.0   6) mpt/2.19

I did not yet try to use the intel compilers or OpenMPI - let me please know if you need those, then we would add some configuration options.

Weiming-Hu commented 4 years ago

That's sounds good. Thanks.

On Thu, Nov 21, 2019, 09:20 Andre Merzky notifications@github.com wrote:

@Weiming-Hu https://github.com/Weiming-Hu : Excellent, thank you! Please note that you need the devel branches for radical.utils and radical.saga, and remember that this is Python 3 :-) MY module set is:

(ve) cheyenne2 amerzky ~/radical/radical.pilot [fix/cheyenne] $ module list

Currently Loaded Modules: 1) python/3.6.8 2) ncarenv/1.3 3) gnu/8.3.0 4) netcdf/4.6.3 5) ncarcompilers/0.5.0 6) mpt/2.19

I did not yet try to use the intel compilers or OpenMPI - let me please know if you need those, then we need to add some configuration options.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/hpc-workflows/issues/97?email_source=notifications&email_token=AD5JDWWJWVX72X7TKNOVJG3QU2KK7A5CNFSM4IOMYJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2MD3I#issuecomment-557105645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5JDWRO3FEFL3OCAHXQHKTQU2KK7ANCNFSM4IOMYJ3A .

Weiming-Hu commented 4 years ago

I ported my code to python 3. There seems like a small issue with mongoDB (mlab).


Setting up RabbitMQ system                                                   n/a
new session: [re.session.cheyenne5.wuh20.018285.0003]                          \
database   : [mongodb://wuh20:example123@ds137271.mlab.com:37271/entk]       err
All components terminated
Traceback (most recent call last):
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/collection.py", line 595, in _insert_command
    retryable_write=retryable_write)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/pool.py", line 613, in command
    user_fields=user_fields)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/network.py", line 167, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/helpers.py", line 159, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: Transaction numbers are only allowed on storage engines that support document-level locking

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/session.py", line 173, in _initialize_primary
    cfg=self._cfg, log=self._log)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/db/database.py", line 86, in __init__
    'connected' : self._connected})
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/collection.py", line 3182, in insert
    check_keys, manipulate, write_concern)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/collection.py", line 612, in _insert
    bypass_doc_val, session)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/collection.py", line 600, in _insert_one
    acknowledged, _insert_command, session)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1491, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1425, in _retry_with_session
    raise OperationFailure(errmsg, exc.code, exc.details)
pymongo.errors.OperationFailure: This MongoDB deployment does not support retryable writes. Please add retryWrites=false to your connection string.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./runme.py", line 181, in <module>
    amgr.run()
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 414, in run
    self._rmgr._submit_resource_request()
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 148, in _submit_resource_request
    self._session = rp.Session(uid=self._sid)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/session.py", line 147, in __init__
    self._initialize_primary(dburl)
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/session.py", line 186, in _initialize_primary
    raise RuntimeError ('session create failed [%s]' % dburl)
RuntimeError: session create failed [mongodb://wuh20:example123@ds137271.mlab.com:37271/entk]
mturilli commented 4 years ago

@Weiming-Hu , as discussed in our meeting, this is probably due to using mlab that, after MongoDB acquisition, is not supported anymore by RCT. Please ping me on Slack and I will give you an alternative mongodb endpoint.

mturilli commented 4 years ago

@andre-merzky , @Weiming-Hu and I tested his script on Cheyenne. We get the following error:

verify python viability: /glade/scratch/amerzky/radical.pilot.sandbox/ve.rp.cheyenne.2019.11.15/bin/python ... ok
verify module viability: radical.pilot   ...Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/pilot/__init__.py", line 10, in <module>
    import radical.utils as _ru
  File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/utils/__init__.py", line 14, in <module>
    from .plugin_manager import PluginManager
  File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/utils/plugin_manager.py", line 15, in <module>
    from .logger    import Logger
  File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/utils/logger.py", line 46, in <module>
    from   .config    import DefaultConfig
  File "/glade/scratch/wuh20/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018311.0011/pilot.0000/rp_install/lib/python3.6/site-packages/radical/utils/config.py", line 125, in <module>
    import munch
ModuleNotFoundError: No module named 'munch'
 failed
python installation cannot load module radical.pilot - abort
andre-merzky commented 4 years ago

We use a static VE on Cheyenne, and that has been created before the RCT stack introduced munch as a new dependency. This can be fixed by installing the module in the static ve, like:

$ source /glade/scratch/amerzky/radical.pilot.sandbox/ve.rp.cheyenne.2019.11.15/bin/activate
$ pip install munch
$ deactivate
andre-merzky commented 4 years ago

Let me do that (you may not have write permissions)

andre-merzky commented 4 years ago

So, in a sad turn of events, it seems that we are back to square-1 with our Cheyenne support. We now face the exact same problem we originally had with virtualenv when using ncar_pylib. As a recap: the last time we did not get any useful feedback from support, so Guido pointed us to a contact which suggested to use ncar_pylib instead of virtualenv. That got us runnin - but now we see the exact same problem with ncar_pylib.

The NCAR documentation is here. Following that (and that's what we did in the past), this should work:

$ ncar_pylib -c 20190627 `pwd`/ve.rp
$ ncar_pylib ve.rp
(ve.rp) $ pip install setproctitle
(ve.rp) $ python -c 'import setproctitle'

but that now results in

$ python -c 'import setproctitle'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: /glade/u/home/amerzky/radical/radical.pilot/ve.rp/lib/python3.6/site-packages/setproctitle.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __intel_sse2_strdup

So, now we can experiment around a bit, and open another ticket - but we are back to were we were, no working python on Cheyenne. I'll open a new ticket for this.

cervone commented 4 years ago

Hello,

At this point, can we use a different system other than Cheyenne? WE are really in a rush to produce results, and if there is a way to get it to run elsewhere, then it is a perfectly fine solution.

Guido

andre-merzky commented 4 years ago

Hi Guido @Cervone, that's probably sensible. The first iteration with Cheyenne support was useless again - I'll keep that channel open of course, but ...

Matteo and I discussed this, and we should try to get you running on Stampede2 meanwhile, just to unstall. Can you remind me please, do you have accounts / allocation on Stampede2?

Best, Andre.

cervone commented 4 years ago

Weiming, what do you think?

Weiming-Hu commented 4 years ago

I don't think it's a problem for me. I need to move the dataset. As soon as I'm assigned with an account and allocation, I can start running them.

andre-merzky commented 4 years ago

@mturilli: can you advise on account creation, please?

Weiming-Hu commented 4 years ago

@mturilli I might already have one. I can log onto the XSEDE portal and Stampede2. Does that mean I can start transferring directly? I'm not sure where I should put my data though. Could you help me with that? Thank you.

mturilli commented 4 years ago

@Weiming-Hu do you see your allocation on Stampede2 from your xsede account? How much data do you have to transfer?

Weiming-Hu commented 4 years ago

I can see the allocation status to be 2714.0 / 95896.0 SUs. I suppose that I can use this allocation?

For data storage, I need to transfer about 10 T.

On Sun, Feb 23, 2020 at 8:04 PM Matteo Turilli notifications@github.com wrote:

@Weiming-Hu https://github.com/Weiming-Hu do you see your allocation on Stampede2 from your xsede account? How much data do you have to transfer?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/hpc-workflows/issues/97?email_source=notifications&email_token=AD5JDWQKEG3LHDKFMXCI32DREMMKBA5CNFSM4IOMYJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMWM55I#issuecomment-590139125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5JDWRXZTZVYF3YFCQFACLREMMKBANCNFSM4IOMYJ3A .

--

Weiming Hu Geography Ph.D. Candidate Geoinformatics and Earth Observation Laboratory The Pennsylvania State University http://geolab.psu.edu/ https://weiming-hu.github.io/ Email: weiming@psu.edu

("-''-/").___..--''"-. `6 6 ) -. ( ).-.__.) We are..... (_Y_.)' ._ ). `. ``-..-' Penn State! ..`--'..-/ /--'_.' ,' Home of the (il),-'' (li),' ((!.-' Nittany Lions!

mturilli commented 4 years ago

Yes, that means you have access to our allocation. 10T in XSEDE terms are a lot of data. We will have to see how to get you that space but, meanwhile, you could use $SCRATCH on Stampede2? It has no space limit but has a purge policy if access time is more than 10 days old. I know this is suboptimal but I also know you are in a hurry.

From https://portal.xsede.org/tacc-stampede2#table3: "*The operating system updates a file's access time when that file is modified on a login or compute node. Reading or executing a file/script on a login node does not update the access time, but reading or executing on a compute node does update the access time. This approach helps us distinguish between routine management tasks (e.g. tar, scp) and production use. Use the command "ls -ul" to view access times."

Weiming-Hu commented 4 years ago

Sounds good. I'll start doing this right away. Thank you.

On Mon, Feb 24, 2020, 05:52 Matteo Turilli notifications@github.com wrote:

Yes, that means you have access to our allocation. 10T in XSEDE terms are a lot of data. We will have to see how to get you that space but, meanwhile, you could use $SCRATCH on Stampede2? It has no space limit but has a purge policy if access time is more than 10 days old. I know this is suboptimal but I also know you are in a hurry.

From https://portal.xsede.org/tacc-stampede2#table3: "*The operating system updates a file's access time when that file is modified on a login or compute node. Reading or executing a file/script on a login node does not update the access time, but reading or executing on a compute node does update the access time. This approach helps us distinguish between routine management tasks (e.g. tar, scp) and production use. Use the command "ls -ul" to view access times."

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/hpc-workflows/issues/97?email_source=notifications&email_token=AD5JDWUA2VLWTESMRUBIYXTREORFXA5CNFSM4IOMYJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMXLOQI#issuecomment-590264129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5JDWUYYHUQILELBHSPEXLREORFXANCNFSM4IOMYJ3A .

mturilli commented 4 years ago

Moved to Stampede2, avoiding Cheyenne for the foreseeable future.

andre-merzky commented 4 years ago

If we ever tackle this again, we may want to try our own Python deployment.