Closed lfzhu-phys closed 3 months ago
If you use the kernel from the last week, it does work? Just to see which change was inbetween. I assume it is due to our move of deprecate to the pyiron_snippets repo and missing tests on the training container...
It was working yesterday. Today I restarted the server and then it does not work any more.
Yes, but the kernels from the previous weeks are also around at the Jupyterhub. So I want to see if it works with the 24.06. Kernel or not. Otherwise It might be related to the switch from the hand maintained environment (that kernel is also there names old_latest) to the new automatically build one.
Ahh, I see your point. After restarting the server, the kernel is automatically using 3.11. I just checked 3.10. The error above disappears. Thanks!
By the way, a short question about submitting jobs to cmmg. ham.server.queue = "cmmg" is not working. Or, should I change the command? Thanks a lot in advance.
Oh, there was a problem with the first global config for cmmg. I might need to update the resources. I will try to check this evening.
@niklassiemer Thanks a lot.
Yes, a git pull on the resources changed the cmmg slurm script template. I hope queue=cmmg is now working!
@niklassiemer I just tested, unfortunately it still doesn't work. I add the error message in the following:
KeyError Traceback (most recent call last)
Cell In[14], line 49
47 ham.set_kpoints(mesh=kmesh)
48 ham.input.potcar["xc"] = xc
---> 49 ham.server.queue = "cmmg"
50 ham.server.cores = 256
51 #ham.server.run_mode.manual = True
52 #ham.run()
53 #shutil.copy(os.path.join(potcar_path, "POTCAR"), ham.working_directory)
54 #ham.server.run_mode.queue = True
55 #ham.status.created = True
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_base/interfaces/lockable.py:63, in sentinel.<locals>.dispatch_or_error(self, *args, **kwargs)
58 elif self.read_only and method == "warning":
59 warnings.warn(
60 f"{meth.__name__} called on {type(self)}, but object is locked!",
61 category=LockedWarning,
62 )
---> 63 return meth(self, *args, **kwargs)
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_base/jobs/job/extension/server/generic.py:215, in Server.queue(self, new_scheduler)
209 else:
210 if state.queue_adapter is not None:
211 (
212 cores,
213 run_time_max,
214 memory_max,
--> 215 ) = state.queue_adapter.check_queue_parameters(
216 queue=new_scheduler,
217 cores=self.cores,
218 run_time_max=self.run_time,
219 memory_max=self.memory_limit,
220 )
221 if self.cores is not None and cores != self.cores:
222 self._cores = cores
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pysqa/queueadapter.py:311, in QueueAdapter.check_queue_parameters(self, queue, cores, run_time_max, memory_max, active_queue)
291 def check_queue_parameters(
292 self,
293 queue: str,
(...)
297 active_queue: Optional[dict] = None,
298 ):
299 """
300
301 Args:
(...)
309 list: [cores, run_time_max, memory_max]
310 """
--> 311 return self._adapter.check_queue_parameters(
312 queue=queue,
313 cores=cores,
314 run_time_max=run_time_max,
315 memory_max=memory_max,
316 active_queue=active_queue,
317 )
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pysqa/utils/basic.py:335, in BasisQueueAdapter.check_queue_parameters(self, queue, cores, run_time_max, memory_max, active_queue)
322 """
323
324 Args:
(...)
332 list: [cores, run_time_max, memory_max]
333 """
334 if active_queue is None:
--> 335 active_queue = self._config["queues"][queue]
336 cores = self._value_in_range(
337 value=cores,
338 value_min=active_queue["cores_min"],
339 value_max=active_queue["cores_max"],
340 )
341 run_time_max = self._value_in_range(
342 value=run_time_max, value_max=active_queue["run_time_max"]
343 )
KeyError: 'cmmg'
Ah, we had actually called it s_cmmg
(for jobs with <= 256 cores) and p_cmmg
(for jobs > 256 cores, but best to ask for multiples of it), sorry for the confusion.
Ah, we had actually called it
s_cmmg
(for jobs with <= 256 cores) andp_cmmg
(for jobs > 256 cores, but best to ask for multiples of it), sorry for the confusion.
Great, thanks a lot. I just tried it out. It works now:))
ennn, but it immediately crashed with the following error:
Abort(1091471) on node 63 (rank 63 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703).......:
MPID_Init(958)..............:
MPIDI_OFI_mpi_init_hook(883): OFI addrinfo() failed (ofi_init.c:883:MPIDI_OFI_mpi_init_hook:No data available)
In: PMI_Abort(1091471, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703).......:
@lfzhu-phys Which executable are you using? What kind of calculation is it?
It is a vasp calculation. The executable is automatically called as
"/cmmc/u/system/SLES12/soft/pyiron/dev/pyiron-resources-cmmc/vasp/bin/run_vasp_5.4.4_mpi.sh".
Should I assign a new compiled vasp on the new cluster?
Actually, I thought the old executables would still work but maybe be slower than possible... so for the vasp error I do not yet have a clue... Indeed, I also think we need after all to fix the very reason for this issue. Using an old kernel gladly works, but it should work for the most recent one.
Hi Lifang, you can use the 6.4.0 version on cmmg. job.executable = '6.4.0_mpi'
This I use and I am sure that it works on cmmg*, I am working on understanding why 5.4.4 does not work but the cluster is currently very busy to test job submission.
*I notice that it sometimes gives a segmentation error (let me know if you face the same issue) for large jobs; I will also work on that.
@ahmedabdelkawy Thank you very much. I am trying to submit a job on cmmg using job.executable = '6.4.0_mpi'.
Maybe it's helpful to debug the problem by looking into the full error message when using 5.4.4. I add them in the following:
Abort(1091471) on node 114 (rank 114 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703).......:
MPID_Init(958)..............:
MPIDI_OFI_mpi_init_hook(883): OFI addrinfo() failed (ofi_init.c:883:MPIDI_OFI_mpi_init_hook:No data available)
In: PMI_Abort(1091471, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703).......:
MPID_Init(958)..............:
MPIDI_OFI_mpi_init_hook(883): OFI addrinfo() failed (ofi_init.c:883:MPIDI_OFI_mpi_init_hook:No data available))
slurmstepd: error: *** STEP 8338191.0 ON cmmg010 CANCELLED AT 2024-07-08T10:25:09 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: cmmg010: tasks 0-113,115-255: Killed
srun: Terminating StepId=8338191.0
srun: error: cmmg010: task 114: Exited with exit code 143
@lfzhu-phys Do you have any specific modules loaded in your ~/.bashrc
/ ~/.bash_profile
?
Ahh, I am using .tcshrc. On top of the file, it is like the following:
module purge
module load intel/19.1.0
module load impi/2019.6
module load mkl/2020
@jan-janssen Would this be a problem?
I do not know. Can you try commenting those out and see if it has an effect?
OK, I will test it.
Now I have two updates:
1) Using job.executable = '6.4.0_mpi' on cmmg works well.
2) By commenting out the modules in my previous .tcshrc, the job started to run, but crashed with the segmentation error mentioned by @ahmedabdelkawy
The error message is as following:
Stack trace terminated abnormally.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
vasp_std 0000000005DAA15A Unknown Unknown Unknown
libpthread-2.31.s 000015222F888910 Unknown Unknown Unknown
libfabric.so.1 000015222DC0D552 fi_param_get Unknown Unknown
I can reproduce the issue. I believe there is an inconsistency between the loaded modules in the resources and module files compared to the ones used to compile the 5.4.4 version. I am tracing it but I am also puzzled why that only showed up for cmmg.
Update: The above issue is not only on the new cluster cmmg. Yesterday (8 July) afternoon (3-4pm) all my vasp jobs on cmfe using vasp.5.4.4 are crashed with the same segmentation error message. However, the jobs submitted yesterday morning were successfully finished. Is there any change on the cluster yesterday?
I dont think it is actually cmfe (it has been out of service for a couple of months now). Probably what you mean is the partition that is s.cmfe or p.cmfe which in my understanding will run also on cmti nodes (but this means that the problem is also on cmti? which I don't think so!). You can know exactly which nodes used to run the previous jobs from slurm (sacct: sacct -u aabdelkawy -S2024-07-09 --format=User,Jobname,state,elapsed,ncpus,NodeList | grep pi (change it accordingly)).
We believe the problem was due to incompatibility between the intel/mpi library used to compile and needed to run the executable and the AMD nodes. I pushed a fix for it by using one of the working intel/mpi libraries. Once the changes are merged on the cluster you can use it on cmmg also. Please let me know if you face any other problems. I would also suggest to move entirely to the 6.4.0 vasp version as this is more recent and much easier to maintain.
Thanks a lot @ahmedabdelkawy. I changed to use 6.4.0 vasp on cmti. It works well now.
The resources problem for running vasp 544 on cmmg on the cluster is fixed, and the branch is merged to master! I guess we can close this now!
After restarting the server today, the mtp project cannot be imported.
The input is as following:
The error is as following: