Closed DanilaOleynik closed 7 years ago
Hi Danila,
Looks like that pbspro notation in ver. 13 changed a little again, so "#PBS -l nodes=4:ppn=16" goes away, and now it looks like "#PBS -l select=1:ncpus=3:mem=1gb" ( http://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf ). Of course, we found it last friday than production on Anselm supercomputer was broken (https://docs.it4i.cz/anselm/job-submission-and-execution/).
These changes are generally not version based, but configuration based and site specific.
Is Anselm a TORQUE or PBS Pro installation? And did you mean that you have been using it in the past successfully with SAGA?
Also, on Titan we got on login node: qstat --version Version: 5.1.0.h1 Commit: c276f6e02a272546ebb4a0c7e744db9c7226697c
and on DTN nodes (recently upgraded): qstat --version Version: 6.0.2 Commit: d9a34839a0f975d5c487bbfcf5dcb10b6a8f1e79
Titan is a TORQUE installation and not PBS Pro, I believe the same goes for DTN.
What do the commits refer to?
Thanks
Mark
Hi Mark,
Well, will be great to know, how to provide configuration in this case :-)
Anselm is PBSpro installation. Current pbspro adaptor worked well, before recent upgrade of infrastructure in it4i.
OLCF (Titan etc.) uses TORQUE, but looks like some specific build. They didn't change PBS dialect with changing of versions. Meanwhile, enviroment of login nodes on Titan allow to identify it as Cray, but DTNs is a bit different infrastructer (without Cray specific), but allow to send jobs to Titan.
As i know, SAGA pbspro adaptor serves well as for PBSPro so for TORQUE.
I don't really get question about commits.
Cheers, Danila
Well, will be great to know, how to provide configuration in this case :-)
What I meant is that version alone doesn't dictate configuration and that it is hard to discover. Thats why we ended up making some hardcoded entries for some sites.
Anselm is PBSpro installation. Current pbspro adaptor worked well, before recent upgrade of infrastructure in it4i.
Ok, so we would need to look into creating a specific entry for Anselm too then. (I changed the subject accordingly)
As i know, SAGA pbspro adaptor serves well as for PBSPro so for TORQUE.
We used to have one PBS adaptor (pbs://), but that one is deprecated. We now have a TORQUE (torque://) and PBS Pro (pbspro://) adaptor. We use the TORQUE adaptor for Titan.
I don't really get question about commits.
There were commit id's in your initial message, but by now I understand where they come from :-)
10 апр. 2017 г., в 20:54, Mark Santcroos notifications@github.com написал(а):
Well, will be great to know, how to provide configuration in this case :-)
What I meant is that version alone doesn't dictate configuration and that it is hard to discover. Thats why we ended up making some hardcoded entries for some sites
O, dear. It’s one of first part which i check after crashes in case of any infrastructure updates :-)
Anselm is PBSpro installation. Current pbspro adaptor worked well, before recent upgrade of infrastructure in it4i.
Ok, so we would need to look into creating a specific entry for Anselm too then. (I changed the subject accordingly)
Any name will work for me. Meanwhile, here is reported version from Anselm
[doleynik@login2.anselm anselm]$ qstat --version pbs_version = PBSPro_13.1.1.162303
As i know, SAGA pbspro adaptor serves well as for PBSPro so for TORQUE.
We used to have one PBS adaptor (pbs://), but that one is deprecated. We now have a TORQUE (torque://) and PBS Pro (pbspro://) adaptor. We use the TORQUE adaptor for Titan
Ok, I will keep it in mind. Thanks for pointing me.
Thanks, Danila
Meanwhile, here is reported version from Anselm
[doleynik@login2.anselm anselm]$ qstat --version pbs_version = PBSPro_13.1.1.162303
What is the job service url that you use? (To be sure that we match it)
11 апр. 2017 г., в 8:48, Mark Santcroos notifications@github.com написал(а):
Meanwhile, here is reported version from Anselm
[doleynik@login2.anselm mailto:doleynik@login2.anselm anselm]$ qstat --version pbs_version = PBSPro_13.1.1.162303
What is the job service url that you use? (To be sure that we match it
Here is how it looks like on Anselm. saga.job.Service("pbspro://localhost")
Cheers, Danila
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/radical-cybertools/saga-python/issues/629#issuecomment-293167248, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_O1KnnkKpXWhby7W3-OFIT36sxu4aBks5ruyJbgaJpZM4M45K7.
So we need something amongst the lines of if 'anselm' in url.host or 'anselm' in os.uname()[1]:
.
According to PBSPro 13 documentation ( http://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf http://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf ) this changes will work for regular PBSPro installation, i don’t think that restriction by domain will be needed.
Cheers, Danila
11 апр. 2017 г., в 11:15, Mark Santcroos notifications@github.com написал(а):
So we need something amongst the lines of if 'anselm' in url.host or 'anselm' in os.uname()[1]:.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/radical-cybertools/saga-python/issues/629#issuecomment-293199311, or mute the thread https://github.com/notifications/unsubscribe-auth/AE_O1AaJFc9B73D8iwn3wBom6yjdII9hks5ru0SVgaJpZM4M45K7.
Mark, do you have the bandwidth to implement the new configuration? Otherwise I think we have to leave it as a lower priority behind the ongoing stability work in 0.46
Mark and I discussed this shortly, and agreed that the ticket should be on my table, really. I just did not get around to it, yet :/
OK, so fix when you have time, it may or may not make it into 0.46.
Hello,
Is it possible to get and estimation for these fixes, it really block us with deployment.
Cheers, Danila
Dear Danila,
I'll get back to you with an estimate later today.
Best, Andre.
Danila, could you please give the branch fix/issue_629
a try? Lets hope it is as simple as that... :)
Hi Andre, Sorry for late response. Fix works well on Anselm.
Great - this goes into the next release then!
Hi Andre,
Looks like that pbspro notation in ver. 13 changed a little again, so "#PBS -l nodes=4:ppn=16" goes away, and now it looks like
#PBS -l select=1:ncpus=3:mem=1gb
(http://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf). Of course, we found it last Friday than production on Anselm supercomputer was broken (https://docs.it4i.cz/anselm/job-submission-and-execution/).May i ask you about patch, so we can back to production.
Also, on Titan we got on login node:
and on DTN nodes (recently upgraded):
So, fork for Titan should be tuned a bit as well.
Cheers, Danila