radical-cybertools / radical.saga

A Light-Weight Access Layer for Distributed Computing Infrastructure and Reference Implementation of the SAGA Python Language Bindings.
http://radical-cybertools.github.io/saga-python/
Other
83 stars 34 forks source link

pbspro support for anselm #629

Closed DanilaOleynik closed 7 years ago

DanilaOleynik commented 7 years ago

Hi Andre,

Looks like that pbspro notation in ver. 13 changed a little again, so "#PBS -l nodes=4:ppn=16" goes away, and now it looks like #PBS -l select=1:ncpus=3:mem=1gb (http://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf). Of course, we found it last Friday than production on Anselm supercomputer was broken (https://docs.it4i.cz/anselm/job-submission-and-execution/).

May i ask you about patch, so we can back to production.

Also, on Titan we got on login node:

qstat --version
Version: 5.1.0.h1
Commit: c276f6e02a272546ebb4a0c7e744db9c7226697c

and on DTN nodes (recently upgraded):

qstat --version
Version: 6.0.2
Commit: d9a34839a0f975d5c487bbfcf5dcb10b6a8f1e79

So, fork for Titan should be tuned a bit as well.

Cheers, Danila

marksantcroos commented 7 years ago

Hi Danila,

Looks like that pbspro notation in ver. 13 changed a little again, so "#PBS -l nodes=4:ppn=16" goes away, and now it looks like "#PBS -l select=1:ncpus=3:mem=1gb" ( http://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf ). Of course, we found it last friday than production on Anselm supercomputer was broken (https://docs.it4i.cz/anselm/job-submission-and-execution/).

These changes are generally not version based, but configuration based and site specific.

Is Anselm a TORQUE or PBS Pro installation? And did you mean that you have been using it in the past successfully with SAGA?

Also, on Titan we got on login node: qstat --version Version: 5.1.0.h1 Commit: c276f6e02a272546ebb4a0c7e744db9c7226697c

and on DTN nodes (recently upgraded): qstat --version Version: 6.0.2 Commit: d9a34839a0f975d5c487bbfcf5dcb10b6a8f1e79

Titan is a TORQUE installation and not PBS Pro, I believe the same goes for DTN.

What do the commits refer to?

Thanks

Mark

DanilaOleynik commented 7 years ago

Hi Mark,

Well, will be great to know, how to provide configuration in this case :-)

Anselm is PBSpro installation. Current pbspro adaptor worked well, before recent upgrade of infrastructure in it4i.

OLCF (Titan etc.) uses TORQUE, but looks like some specific build. They didn't change PBS dialect with changing of versions. Meanwhile, enviroment of login nodes on Titan allow to identify it as Cray, but DTNs is a bit different infrastructer (without Cray specific), but allow to send jobs to Titan.

As i know, SAGA pbspro adaptor serves well as for PBSPro so for TORQUE.

I don't really get question about commits.

Cheers, Danila

marksantcroos commented 7 years ago

Well, will be great to know, how to provide configuration in this case :-)

What I meant is that version alone doesn't dictate configuration and that it is hard to discover. Thats why we ended up making some hardcoded entries for some sites.

Anselm is PBSpro installation. Current pbspro adaptor worked well, before recent upgrade of infrastructure in it4i.

Ok, so we would need to look into creating a specific entry for Anselm too then. (I changed the subject accordingly)

As i know, SAGA pbspro adaptor serves well as for PBSPro so for TORQUE.

We used to have one PBS adaptor (pbs://), but that one is deprecated. We now have a TORQUE (torque://) and PBS Pro (pbspro://) adaptor. We use the TORQUE adaptor for Titan.

I don't really get question about commits.

There were commit id's in your initial message, but by now I understand where they come from :-)

DanilaOleynik commented 7 years ago

10 апр. 2017 г., в 20:54, Mark Santcroos notifications@github.com написал(а):

Well, will be great to know, how to provide configuration in this case :-)

What I meant is that version alone doesn't dictate configuration and that it is hard to discover. Thats why we ended up making some hardcoded entries for some sites

O, dear. It’s one of first part which i check after crashes in case of any infrastructure updates :-)

Anselm is PBSpro installation. Current pbspro adaptor worked well, before recent upgrade of infrastructure in it4i.

Ok, so we would need to look into creating a specific entry for Anselm too then. (I changed the subject accordingly)

Any name will work for me. Meanwhile, here is reported version from Anselm

[doleynik@login2.anselm anselm]$ qstat --version pbs_version = PBSPro_13.1.1.162303

As i know, SAGA pbspro adaptor serves well as for PBSPro so for TORQUE.

We used to have one PBS adaptor (pbs://), but that one is deprecated. We now have a TORQUE (torque://) and PBS Pro (pbspro://) adaptor. We use the TORQUE adaptor for Titan

Ok, I will keep it in mind. Thanks for pointing me.

Thanks, Danila

marksantcroos commented 7 years ago

Meanwhile, here is reported version from Anselm

[doleynik@login2.anselm anselm]$ qstat --version pbs_version = PBSPro_13.1.1.162303

What is the job service url that you use? (To be sure that we match it)

DanilaOleynik commented 7 years ago

11 апр. 2017 г., в 8:48, Mark Santcroos notifications@github.com написал(а):

Meanwhile, here is reported version from Anselm

[doleynik@login2.anselm mailto:doleynik@login2.anselm anselm]$ qstat --version pbs_version = PBSPro_13.1.1.162303

What is the job service url that you use? (To be sure that we match it

Here is how it looks like on Anselm. saga.job.Service("pbspro://localhost")

Cheers, Danila

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/radical-cybertools/saga-python/issues/629#issuecomment-293167248, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_O1KnnkKpXWhby7W3-OFIT36sxu4aBks5ruyJbgaJpZM4M45K7.

marksantcroos commented 7 years ago

So we need something amongst the lines of if 'anselm' in url.host or 'anselm' in os.uname()[1]:.

DanilaOleynik commented 7 years ago

According to PBSPro 13 documentation ( http://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf http://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf ) this changes will work for regular PBSPro installation, i don’t think that restriction by domain will be needed.

Cheers, Danila

11 апр. 2017 г., в 11:15, Mark Santcroos notifications@github.com написал(а):

So we need something amongst the lines of if 'anselm' in url.host or 'anselm' in os.uname()[1]:.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/radical-cybertools/saga-python/issues/629#issuecomment-293199311, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_O1AaJFc9B73D8iwn3wBom6yjdII9hks5ru0SVgaJpZM4M45K7.

ibethune commented 7 years ago

Mark, do you have the bandwidth to implement the new configuration? Otherwise I think we have to leave it as a lower priority behind the ongoing stability work in 0.46

andre-merzky commented 7 years ago

Mark and I discussed this shortly, and agreed that the ticket should be on my table, really. I just did not get around to it, yet :/

ibethune commented 7 years ago

OK, so fix when you have time, it may or may not make it into 0.46.

DanilaOleynik commented 7 years ago

Hello,

Is it possible to get and estimation for these fixes, it really block us with deployment.

Cheers, Danila

andre-merzky commented 7 years ago

Dear Danila,

I'll get back to you with an estimate later today.

Best, Andre.

andre-merzky commented 7 years ago

Danila, could you please give the branch fix/issue_629 a try? Lets hope it is as simple as that... :)

DanilaOleynik commented 7 years ago

Hi Andre, Sorry for late response. Fix works well on Anselm.

andre-merzky commented 7 years ago

Great - this goes into the next release then!