Closed skatnagallu closed 1 month ago
Can you give an estimate how old these calculations are? Can you load one of these jobs in inspect mode?
job = pr.inspect(<job_id>)
print(job["server"])
Here is the output. The parent job is 12 days old. But I dont see it in the output.
{'user': 'skatnagallu', 'host': 'cmti001', 'run_mode': 'queue', 'cores': 40, 'threads': 1, 'new_hdf': True, 'accept_crash': False, 'additional_arguments': {}, 'gpus': None, 'run_time': None, 'memory_limit': None, 'queue': 'cmti', 'qid': 9018591, 'conda_environment_name': None, 'conda_environment_path': None, 'NAME': 'Server', 'TYPE': "<class 'pyiron_base.jobs.job.extension.server.generic.Server'>", 'OBJECT': 'Server', 'DICT_VERSION': '0.1.0'}
Ok, so the key was renamed from new_h5
to new_hdf
. This was changed in https://github.com/pyiron/pyiron_base/pull/1578 . The strange part is that new_hdf
is the new name, so you create a new job with a newer version of pyiron and then try to start it with an older version. Can you try restarting your jupyter server? For some reason your notebook uses a more recent kernel than the environment used to execute the job.
@niklassiemer Is this a known issue? For some reason the job is created with a more recent kernel than the one it is executed with after submission to the queuing system.
I was not aware of such an issue so far. I could only look into it next week, though...
One thing, the kernel which tried to load the calculation is from 02.09. according to the stack trace. I.e. about two weeks older than the kernel '12 days ago'. Which kernel did you use to run the notebook?
I used the latest kernel to run the notebook. When the restart job failed, I used an older kernel. But I had the same issue.
Do you have any additional environment specification in your ~/.bashrc
? Just to make sure the issue is reproducible, can you share a minimal example to reproduce the issue?
So restarting the jupyter hub has solved the issue :) . yes I do have additional environment specifications. But now since I restarted the jupyter hub I dont think I can reproduce the issue .
Thanks for the information! I will investigate the configuration again to make sure that latest really is latest at all times.
PS: to reproduce the issue, use the kernel from 02.09.24. That should produce the error. The main issue seems to be that you started the calculation with newer kernel than the loading one. And I really wonder how (if you used latest all the time)... do you use cmti001 and cmti002? If there is a problem with latest being async (which I currently assume), it could differ between the Jupyterhubs?!
I am using cmti001. I used the kernel from 02.09.24. Now I can't load the old parent job, it gives the same error.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[14], line 1
----> 1 job_1m12 = pr.load('wf_Al5Mn8_1m12_5layer')
2 job_m112 = pr.load('wf_Al5Mn8_m112_5layer')
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/jobloader.py:105, in JobLoader.__call__(self, job_specifier, convert_to_object)
104 def __call__(self, job_specifier, convert_to_object=None) -> GenericJob:
--> 105 return super().__call__(job_specifier, convert_to_object=convert_to_object)
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/jobloader.py:76, in _JobByAttribute.__call__(self, job_specifier, convert_to_object)
72 state.logger.warning(
73 f"Job '{job_specifier}' does not exist and cannot be loaded"
74 )
75 return None
---> 76 return self._project.load_from_jobpath(
77 job_id=job_id,
78 convert_to_object=(
79 convert_to_object
80 if convert_to_object is not None
81 else self.convert_to_object
82 ),
83 )
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_atomistics/project.py:333, in Project.load_from_jobpath(self, job_id, db_entry, convert_to_object)
319 def load_from_jobpath(self, job_id=None, db_entry=None, convert_to_object=True):
320 """
321 Internal function to load an existing job either based on the job ID or based on the database entry dictionary.
322
(...)
331 GenericJob, JobCore: Either the full GenericJob object or just a reduced JobCore object
332 """
--> 333 job = super(Project, self).load_from_jobpath(
334 job_id=job_id, db_entry=db_entry, convert_to_object=convert_to_object
335 )
336 job.project_hdf5._project = self.__class__(path=job.project_hdf5.file_path)
337 return job
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/generic.py:1141, in Project.load_from_jobpath(self, job_id, db_entry, convert_to_object)
1139 job = JobPath.from_job_id(db=self.db, job_id=job_id)
1140 if convert_to_object:
-> 1141 job = job.to_object()
1142 job.reset_job_id(job_id=job_id)
1143 job.set_input_to_read_only()
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/core.py:597, in JobCore.to_object(self, object_type, **qwargs)
591 if self.project_hdf5.is_empty:
592 raise ValueError(
593 'The HDF5 file of this job with the job_name: "'
594 + self.job_name
595 + '" is empty, so it can not be loaded.'
596 )
--> 597 return self.project_hdf5.to_object(object_type, **qwargs)
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/storage/hdfio.py:1162, in ProjectHDFio.to_object(self, class_name, **kwargs)
1148 def to_object(self, class_name=None, **kwargs):
1149 """
1150 Load the full pyiron object from an HDF5 file
1151
(...)
1160 pyiron object of the given class_name
1161 """
-> 1162 return _to_object(self, class_name, **kwargs)
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/storage/hdfio.py:158, in _to_object(hdf, class_name, **kwargs)
155 init_args.update(kwargs)
157 obj = class_object(**init_args)
--> 158 obj.from_hdf(hdf=hdf.open(".."), group_name=hdf.h5_path.split("/")[-1])
159 if static_isinstance(obj=obj, obj_type="pyiron_base.jobs.job.generic.GenericJob"):
160 module_name = module_path.split(".")[0]
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_atomistics/sphinx/base.py:877, in SphinxBase.from_hdf(self, hdf, group_name)
875 self.input[k] = gp[k]
876 elif self._hdf5["HDF_VERSION"] == "0.1.0":
--> 877 super(SphinxBase, self).from_hdf(hdf=hdf, group_name=group_name)
878 self._structure_from_hdf()
879 with self._hdf5.open("input") as hdf:
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/interactive.py:364, in InteractiveBase.from_hdf(self, hdf, group_name)
356 def from_hdf(self, hdf=None, group_name=None):
357 """
358 Restore the InteractiveBase object in the HDF5 File
359
(...)
362 group_name (str): HDF5 subgroup name - optional
363 """
--> 364 super(InteractiveBase, self).from_hdf(hdf=hdf, group_name=group_name)
365 with self.project_hdf5.open("input") as hdf5_input:
366 if "interactive" in hdf5_input.list_nodes():
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/generic.py:1245, in GenericJob.from_hdf(self, hdf, group_name)
1243 exe_dict["READ_ONLY"] = self._hdf5["executable/executable/READ_ONLY"]
1244 job_dict["executable"] = {"executable": exe_dict}
-> 1245 self.from_dict(job_dict=job_dict)
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_atomistics/atomistics/job/atomistic.py:301, in AtomisticGenericJob.from_dict(self, job_dict)
300 def from_dict(self, job_dict):
--> 301 super().from_dict(job_dict=job_dict)
302 self._generic_input.from_dict(obj_dict=job_dict["input"]["generic"])
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/generic.py:1181, in GenericJob.from_dict(self, job_dict)
1179 if "import_directory" in job_dict.keys():
1180 self._import_directory = job_dict["import_directory"]
-> 1181 self._server.from_dict(server_dict=job_dict["server"])
1182 if "executable" in job_dict.keys() and job_dict["executable"] is not None:
1183 self.executable.from_dict(job_dict["executable"])
File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/extension/server/generic.py:606, in Server.from_dict(self, server_dict)
604 if "accept_crash" in server_dict.keys():
605 self._accept_crash = server_dict["accept_crash"] == 1
--> 606 self._new_hdf = server_dict["new_h5"] == 1
KeyError: 'new_h5'
The job was created with a more recent pyiron version and it cannot be loaded with an older version. Can you try to use the latest pyiron version?
yes with latest version, there is no issue, after I restarted my server. I was able to load and create a new restarted job from the old one. I was trying to reproduce the issue, using an older kernel.
Thanks a lot! This issue may be closed since it is not related to pyiron code (just the old version not being able to load new versions, which is fair). I do not yet understand why the latest kerel was not the expected one and more severe seemed to change to an older version?! However, that seems to be a cmti/jupyter configuration problem.
So restarting the jupyter hub has solved the issue :) . yes I do have additional environment specifications. But now since I restarted the jupyter hub I dont think I can reproduce the issue .
Somehow I missed this part - that is great news.
I am trying to restart some jobs from another SPHInX job. The jobs initialise succefully and are submitted to the cluster, however soon after the jobs are aborted. The status of the jobs is also not changed and remains as "submitted". This what the error.out file has
I am not sure what's the issue.