pyiron / pyiron_atomistics

pyiron_atomistics - an integrated development environment (IDE) for atomistic simulation in computational materials science.
https://pyiron-atomistics.readthedocs.io
BSD 3-Clause "New" or "Revised" License
44 stars 15 forks source link

ImportError: cannot import name 'deprecate' from 'pyiron_base' (/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_base/__init__.py) #1477

Closed lfzhu-phys closed 3 months ago

lfzhu-phys commented 4 months ago

After restarting the server today, the mtp project cannot be imported.

The input is as following:

j = pr_mtp['fit_24g']
my_potential = j.get_lammps_potential()

The error is as following:

-----------------------------------
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_potentialfit/mlip/mlip.py:21
     13 from pyiron_base import (
     14     state,
     15     GenericParameters,
   (...)
     18     FlattenedStorage,
     19 )
     20 from pyiron_atomistics import Atoms
---> 21 from pyiron_potentialfit.ml.potentialfit import PotentialFit
     22 from pyiron_potentialfit.mlip.cfgs import (
     23     savecfgs,
     24     loadcfgs,
     25     Cfg,
     26     load_grades_ids_and_timesteps,
     27 )
     28 from pyiron_potentialfit.mlip.potential import MtpPotential

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_potentialfit/ml/potentialfit.py:15
     12 import numpy as np
     14 from pyiron_base import FlattenedStorage
---> 15 from pyiron_potentialfit.atomistics.job.trainingcontainer import (
     16     TrainingContainer,
     17     TrainingStorage,
     18 )
     21 class PotentialFit(abc.ABC):
     22     """
     23     Abstract mixin that defines a general interface to potential fitting codes.
     24 
   (...)
     30     predicted data on them after the fit.
     31     """

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_potentialfit/atomistics/job/__init__.py:1
----> 1 from .trainingcontainer import TrainingContainer, TrainingStorage

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_potentialfit/atomistics/job/trainingcontainer.py:44
     39 from pyiron_atomistics.atomistics.structure.structurestorage import (
     40     StructureStorage,
     41     StructurePlots,
     42 )
     43 from pyiron_atomistics.atomistics.structure.neighbors import NeighborsTrajectory
---> 44 from pyiron_base import GenericJob, DataContainer, deprecate
     47 class TrainingContainer(GenericJob, HasStructure):
     48     """
     49     Stores ASE structures with energies and forces.
     50     """

ImportError: cannot import name 'deprecate' from 'pyiron_base' (/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_base/__init__.py)
-----------------------------------------------------------------------
niklassiemer commented 4 months ago

If you use the kernel from the last week, it does work? Just to see which change was inbetween. I assume it is due to our move of deprecate to the pyiron_snippets repo and missing tests on the training container...

lfzhu-phys commented 4 months ago

It was working yesterday. Today I restarted the server and then it does not work any more.

niklassiemer commented 4 months ago

Yes, but the kernels from the previous weeks are also around at the Jupyterhub. So I want to see if it works with the 24.06. Kernel or not. Otherwise It might be related to the switch from the hand maintained environment (that kernel is also there names old_latest) to the new automatically build one.

lfzhu-phys commented 4 months ago

Ahh, I see your point. After restarting the server, the kernel is automatically using 3.11. I just checked 3.10. The error above disappears. Thanks!

lfzhu-phys commented 4 months ago

By the way, a short question about submitting jobs to cmmg. ham.server.queue = "cmmg" is not working. Or, should I change the command? Thanks a lot in advance.

niklassiemer commented 4 months ago

Oh, there was a problem with the first global config for cmmg. I might need to update the resources. I will try to check this evening.

lfzhu-phys commented 4 months ago

@niklassiemer Thanks a lot.

niklassiemer commented 4 months ago

Yes, a git pull on the resources changed the cmmg slurm script template. I hope queue=cmmg is now working!

lfzhu-phys commented 4 months ago

@niklassiemer I just tested, unfortunately it still doesn't work. I add the error message in the following:

KeyError                                  Traceback (most recent call last)
Cell In[14], line 49
     47 ham.set_kpoints(mesh=kmesh)
     48 ham.input.potcar["xc"] = xc
---> 49 ham.server.queue = "cmmg"
     50 ham.server.cores = 256
     51 #ham.server.run_mode.manual = True
     52 #ham.run()
     53 #shutil.copy(os.path.join(potcar_path, "POTCAR"), ham.working_directory)
     54 #ham.server.run_mode.queue = True
     55 #ham.status.created = True

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_base/interfaces/lockable.py:63, in sentinel.<locals>.dispatch_or_error(self, *args, **kwargs)
     58 elif self.read_only and method == "warning":
     59     warnings.warn(
     60         f"{meth.__name__} called on {type(self)}, but object is locked!",
     61         category=LockedWarning,
     62     )
---> 63 return meth(self, *args, **kwargs)

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pyiron_base/jobs/job/extension/server/generic.py:215, in Server.queue(self, new_scheduler)
    209 else:
    210     if state.queue_adapter is not None:
    211         (
    212             cores,
    213             run_time_max,
    214             memory_max,
--> 215         ) = state.queue_adapter.check_queue_parameters(
    216             queue=new_scheduler,
    217             cores=self.cores,
    218             run_time_max=self.run_time,
    219             memory_max=self.memory_limit,
    220         )
    221         if self.cores is not None and cores != self.cores:
    222             self._cores = cores

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pysqa/queueadapter.py:311, in QueueAdapter.check_queue_parameters(self, queue, cores, run_time_max, memory_max, active_queue)
    291 def check_queue_parameters(
    292     self,
    293     queue: str,
   (...)
    297     active_queue: Optional[dict] = None,
    298 ):
    299     """
    300 
    301     Args:
   (...)
    309         list: [cores, run_time_max, memory_max]
    310     """
--> 311     return self._adapter.check_queue_parameters(
    312         queue=queue,
    313         cores=cores,
    314         run_time_max=run_time_max,
    315         memory_max=memory_max,
    316         active_queue=active_queue,
    317     )

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-07-01/lib/python3.11/site-packages/pysqa/utils/basic.py:335, in BasisQueueAdapter.check_queue_parameters(self, queue, cores, run_time_max, memory_max, active_queue)
    322 """
    323 
    324 Args:
   (...)
    332     list: [cores, run_time_max, memory_max]
    333 """
    334 if active_queue is None:
--> 335     active_queue = self._config["queues"][queue]
    336 cores = self._value_in_range(
    337     value=cores,
    338     value_min=active_queue["cores_min"],
    339     value_max=active_queue["cores_max"],
    340 )
    341 run_time_max = self._value_in_range(
    342     value=run_time_max, value_max=active_queue["run_time_max"]
    343 )

KeyError: 'cmmg'
pmrv commented 4 months ago

Ah, we had actually called it s_cmmg (for jobs with <= 256 cores) and p_cmmg (for jobs > 256 cores, but best to ask for multiples of it), sorry for the confusion.

lfzhu-phys commented 4 months ago

Ah, we had actually called it s_cmmg (for jobs with <= 256 cores) and p_cmmg (for jobs > 256 cores, but best to ask for multiples of it), sorry for the confusion.

Great, thanks a lot. I just tried it out. It works now:))

lfzhu-phys commented 4 months ago

ennn, but it immediately crashed with the following error:

Abort(1091471) on node 63 (rank 63 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703).......: 
MPID_Init(958)..............: 
MPIDI_OFI_mpi_init_hook(883): OFI addrinfo() failed (ofi_init.c:883:MPIDI_OFI_mpi_init_hook:No data available)
In: PMI_Abort(1091471, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703).......:
jan-janssen commented 4 months ago

@lfzhu-phys Which executable are you using? What kind of calculation is it?

lfzhu-phys commented 4 months ago

It is a vasp calculation. The executable is automatically called as

"/cmmc/u/system/SLES12/soft/pyiron/dev/pyiron-resources-cmmc/vasp/bin/run_vasp_5.4.4_mpi.sh".

Should I assign a new compiled vasp on the new cluster?

niklassiemer commented 4 months ago

Actually, I thought the old executables would still work but maybe be slower than possible... so for the vasp error I do not yet have a clue... Indeed, I also think we need after all to fix the very reason for this issue. Using an old kernel gladly works, but it should work for the most recent one.

ahmedabdelkawy commented 3 months ago

Hi Lifang, you can use the 6.4.0 version on cmmg. job.executable = '6.4.0_mpi'

This I use and I am sure that it works on cmmg*, I am working on understanding why 5.4.4 does not work but the cluster is currently very busy to test job submission.

*I notice that it sometimes gives a segmentation error (let me know if you face the same issue) for large jobs; I will also work on that.

lfzhu-phys commented 3 months ago

@ahmedabdelkawy Thank you very much. I am trying to submit a job on cmmg using job.executable = '6.4.0_mpi'.

Maybe it's helpful to debug the problem by looking into the full error message when using 5.4.4. I add them in the following:

Abort(1091471) on node 114 (rank 114 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703).......: 
MPID_Init(958)..............: 
MPIDI_OFI_mpi_init_hook(883): OFI addrinfo() failed (ofi_init.c:883:MPIDI_OFI_mpi_init_hook:No data available)
In: PMI_Abort(1091471, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703).......:
MPID_Init(958)..............:
MPIDI_OFI_mpi_init_hook(883): OFI addrinfo() failed (ofi_init.c:883:MPIDI_OFI_mpi_init_hook:No data available))
slurmstepd: error: *** STEP 8338191.0 ON cmmg010 CANCELLED AT 2024-07-08T10:25:09 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: cmmg010: tasks 0-113,115-255: Killed
srun: Terminating StepId=8338191.0
srun: error: cmmg010: task 114: Exited with exit code 143
jan-janssen commented 3 months ago

@lfzhu-phys Do you have any specific modules loaded in your ~/.bashrc / ~/.bash_profile?

lfzhu-phys commented 3 months ago

Ahh, I am using .tcshrc. On top of the file, it is like the following:

module purge
module load intel/19.1.0
module load impi/2019.6
module load mkl/2020

@jan-janssen Would this be a problem?

jan-janssen commented 3 months ago

I do not know. Can you try commenting those out and see if it has an effect?

lfzhu-phys commented 3 months ago

OK, I will test it.

lfzhu-phys commented 3 months ago

Now I have two updates:

1) Using job.executable = '6.4.0_mpi' on cmmg works well.

2) By commenting out the modules in my previous .tcshrc, the job started to run, but crashed with the segmentation error mentioned by @ahmedabdelkawy

The error message is as following:

Stack trace terminated abnormally.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
vasp_std           0000000005DAA15A  Unknown               Unknown  Unknown
libpthread-2.31.s  000015222F888910  Unknown               Unknown  Unknown
libfabric.so.1     000015222DC0D552  fi_param_get          Unknown  Unknown
ahmedabdelkawy commented 3 months ago

I can reproduce the issue. I believe there is an inconsistency between the loaded modules in the resources and module files compared to the ones used to compile the 5.4.4 version. I am tracing it but I am also puzzled why that only showed up for cmmg.

lfzhu-phys commented 3 months ago

Update: The above issue is not only on the new cluster cmmg. Yesterday (8 July) afternoon (3-4pm) all my vasp jobs on cmfe using vasp.5.4.4 are crashed with the same segmentation error message. However, the jobs submitted yesterday morning were successfully finished. Is there any change on the cluster yesterday?

ahmedabdelkawy commented 3 months ago

I dont think it is actually cmfe (it has been out of service for a couple of months now). Probably what you mean is the partition that is s.cmfe or p.cmfe which in my understanding will run also on cmti nodes (but this means that the problem is also on cmti? which I don't think so!). You can know exactly which nodes used to run the previous jobs from slurm (sacct: sacct -u aabdelkawy -S2024-07-09 --format=User,Jobname,state,elapsed,ncpus,NodeList | grep pi (change it accordingly)).

We believe the problem was due to incompatibility between the intel/mpi library used to compile and needed to run the executable and the AMD nodes. I pushed a fix for it by using one of the working intel/mpi libraries. Once the changes are merged on the cluster you can use it on cmmg also. Please let me know if you face any other problems. I would also suggest to move entirely to the 6.4.0 vasp version as this is more recent and much easier to maintain.

lfzhu-phys commented 3 months ago

Thanks a lot @ahmedabdelkawy. I changed to use 6.4.0 vasp on cmti. It works well now.

ahmedabdelkawy commented 3 months ago

The resources problem for running vasp 544 on cmmg on the cluster is fixed, and the branch is merged to master! I guess we can close this now!