Cannot (re)start SPHInX jobs on cluster

skatnagallu commented 1 week ago

I am trying to restart some jobs from another SPHInX job. The jobs initialise succefully and are submitted to the cluster, however soon after the jobs are aborted. The status of the jobs is also not changed and remains as "submitted". This what the error.out file has

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/cli/__main__.py", line 3, in <module>
    main()
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/cli/control.py", line 61, in main
    args.cli(args)
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/cli/wrapper.py", line 37, in main
    job_wrapper_function(
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/wrapper.py", line 162, in job_wrapper_function
    job = JobWrapper(
          ^^^^^^^^^^^
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/wrapper.py", line 65, in __init__
    self.job = pr.load(int(job_id))
               ^^^^^^^^^^^^^^^^^^^^
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/jobloader.py", line 105, in __call__
    return super().__call__(job_specifier, convert_to_object=convert_to_object)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/jobloader.py", line 76, in __call__
    return self._project.load_from_jobpath(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/generic.py", line 1141, in load_from_jobpath
    job = job.to_object()
          ^^^^^^^^^^^^^^^
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/core.py", line 597, in to_object
    return self.project_hdf5.to_object(object_type, **qwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/storage/hdfio.py", line 1162, in to_object
    return _to_object(self, class_name, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/storage/hdfio.py", line 158, in _to_object
    obj.from_hdf(hdf=hdf.open(".."), group_name=hdf.h5_path.split("/")[-1])
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_atomistics/sphinx/base.py", line 877, in from_hdf
    super(SphinxBase, self).from_hdf(hdf=hdf, group_name=group_name)
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/interactive.py", line 364, in from_hdf
    super(InteractiveBase, self).from_hdf(hdf=hdf, group_name=group_name)
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/generic.py", line 1245, in from_hdf
    self.from_dict(job_dict=job_dict)
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_atomistics/atomistics/job/atomistic.py", line 301, in from_dict
    super().from_dict(job_dict=job_dict)
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/generic.py", line 1181, in from_dict
    self._server.from_dict(server_dict=job_dict["server"])
  File "/cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/extension/server/generic.py", line 606, in from_dict
    self._new_hdf = server_dict["new_h5"] == 1
                    ~~~~~~~~~~~^^^^^^^^^^
KeyError: 'new_h5'

I am not sure what's the issue.

jan-janssen commented 1 week ago

Can you give an estimate how old these calculations are? Can you load one of these jobs in inspect mode?

job = pr.inspect(<job_id>)
print(job["server"])

skatnagallu commented 1 week ago

Here is the output. The parent job is 12 days old. But I dont see it in the output. {'user': 'skatnagallu', 'host': 'cmti001', 'run_mode': 'queue', 'cores': 40, 'threads': 1, 'new_hdf': True, 'accept_crash': False, 'additional_arguments': {}, 'gpus': None, 'run_time': None, 'memory_limit': None, 'queue': 'cmti', 'qid': 9018591, 'conda_environment_name': None, 'conda_environment_path': None, 'NAME': 'Server', 'TYPE': "<class 'pyiron_base.jobs.job.extension.server.generic.Server'>", 'OBJECT': 'Server', 'DICT_VERSION': '0.1.0'}

jan-janssen commented 1 week ago

Ok, so the key was renamed from new_h5 to new_hdf. This was changed in https://github.com/pyiron/pyiron_base/pull/1578 . The strange part is that new_hdf is the new name, so you create a new job with a newer version of pyiron and then try to start it with an older version. Can you try restarting your jupyter server? For some reason your notebook uses a more recent kernel than the environment used to execute the job.

jan-janssen commented 1 week ago

@niklassiemer Is this a known issue? For some reason the job is created with a more recent kernel than the one it is executed with after submission to the queuing system.

niklassiemer commented 1 week ago

I was not aware of such an issue so far. I could only look into it next week, though...

niklassiemer commented 1 week ago

One thing, the kernel which tried to load the calculation is from 02.09. according to the stack trace. I.e. about two weeks older than the kernel '12 days ago'. Which kernel did you use to run the notebook?

skatnagallu commented 6 days ago

I used the latest kernel to run the notebook. When the restart job failed, I used an older kernel. But I had the same issue.

jan-janssen commented 6 days ago

Do you have any additional environment specification in your ~/.bashrc? Just to make sure the issue is reproducible, can you share a minimal example to reproduce the issue?

skatnagallu commented 6 days ago

So restarting the jupyter hub has solved the issue :) . yes I do have additional environment specifications. But now since I restarted the jupyter hub I dont think I can reproduce the issue .

niklassiemer commented 6 days ago

Thanks for the information! I will investigate the configuration again to make sure that latest really is latest at all times.

niklassiemer commented 6 days ago

PS: to reproduce the issue, use the kernel from 02.09.24. That should produce the error. The main issue seems to be that you started the calculation with newer kernel than the loading one. And I really wonder how (if you used latest all the time)... do you use cmti001 and cmti002? If there is a problem with latest being async (which I currently assume), it could differ between the Jupyterhubs?!

skatnagallu commented 6 days ago

I am using cmti001. I used the kernel from 02.09.24. Now I can't load the old parent job, it gives the same error.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[14], line 1
----> 1 job_1m12 = pr.load('wf_Al5Mn8_1m12_5layer')
      2 job_m112 = pr.load('wf_Al5Mn8_m112_5layer')

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/jobloader.py:105, in JobLoader.__call__(self, job_specifier, convert_to_object)
    104 def __call__(self, job_specifier, convert_to_object=None) -> GenericJob:
--> 105     return super().__call__(job_specifier, convert_to_object=convert_to_object)

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/jobloader.py:76, in _JobByAttribute.__call__(self, job_specifier, convert_to_object)
     72     state.logger.warning(
     73         f"Job '{job_specifier}' does not exist and cannot be loaded"
     74     )
     75     return None
---> 76 return self._project.load_from_jobpath(
     77     job_id=job_id,
     78     convert_to_object=(
     79         convert_to_object
     80         if convert_to_object is not None
     81         else self.convert_to_object
     82     ),
     83 )

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_atomistics/project.py:333, in Project.load_from_jobpath(self, job_id, db_entry, convert_to_object)
    319 def load_from_jobpath(self, job_id=None, db_entry=None, convert_to_object=True):
    320     """
    321     Internal function to load an existing job either based on the job ID or based on the database entry dictionary.
    322 
   (...)
    331         GenericJob, JobCore: Either the full GenericJob object or just a reduced JobCore object
    332     """
--> 333     job = super(Project, self).load_from_jobpath(
    334         job_id=job_id, db_entry=db_entry, convert_to_object=convert_to_object
    335     )
    336     job.project_hdf5._project = self.__class__(path=job.project_hdf5.file_path)
    337     return job

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/project/generic.py:1141, in Project.load_from_jobpath(self, job_id, db_entry, convert_to_object)
   1139 job = JobPath.from_job_id(db=self.db, job_id=job_id)
   1140 if convert_to_object:
-> 1141     job = job.to_object()
   1142     job.reset_job_id(job_id=job_id)
   1143     job.set_input_to_read_only()

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/core.py:597, in JobCore.to_object(self, object_type, **qwargs)
    591 if self.project_hdf5.is_empty:
    592     raise ValueError(
    593         'The HDF5 file of this job with the job_name: "'
    594         + self.job_name
    595         + '" is empty, so it can not be loaded.'
    596     )
--> 597 return self.project_hdf5.to_object(object_type, **qwargs)

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/storage/hdfio.py:1162, in ProjectHDFio.to_object(self, class_name, **kwargs)
   1148 def to_object(self, class_name=None, **kwargs):
   1149     """
   1150     Load the full pyiron object from an HDF5 file
   1151 
   (...)
   1160         pyiron object of the given class_name
   1161     """
-> 1162     return _to_object(self, class_name, **kwargs)

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/storage/hdfio.py:158, in _to_object(hdf, class_name, **kwargs)
    155 init_args.update(kwargs)
    157 obj = class_object(**init_args)
--> 158 obj.from_hdf(hdf=hdf.open(".."), group_name=hdf.h5_path.split("/")[-1])
    159 if static_isinstance(obj=obj, obj_type="pyiron_base.jobs.job.generic.GenericJob"):
    160     module_name = module_path.split(".")[0]

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_atomistics/sphinx/base.py:877, in SphinxBase.from_hdf(self, hdf, group_name)
    875         self.input[k] = gp[k]
    876 elif self._hdf5["HDF_VERSION"] == "0.1.0":
--> 877     super(SphinxBase, self).from_hdf(hdf=hdf, group_name=group_name)
    878     self._structure_from_hdf()
    879     with self._hdf5.open("input") as hdf:

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/interactive.py:364, in InteractiveBase.from_hdf(self, hdf, group_name)
    356 def from_hdf(self, hdf=None, group_name=None):
    357     """
    358     Restore the InteractiveBase object in the HDF5 File
    359 
   (...)
    362         group_name (str): HDF5 subgroup name - optional
    363     """
--> 364     super(InteractiveBase, self).from_hdf(hdf=hdf, group_name=group_name)
    365     with self.project_hdf5.open("input") as hdf5_input:
    366         if "interactive" in hdf5_input.list_nodes():

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/generic.py:1245, in GenericJob.from_hdf(self, hdf, group_name)
   1243     exe_dict["READ_ONLY"] = self._hdf5["executable/executable/READ_ONLY"]
   1244     job_dict["executable"] = {"executable": exe_dict}
-> 1245 self.from_dict(job_dict=job_dict)

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_atomistics/atomistics/job/atomistic.py:301, in AtomisticGenericJob.from_dict(self, job_dict)
    300 def from_dict(self, job_dict):
--> 301     super().from_dict(job_dict=job_dict)
    302     self._generic_input.from_dict(obj_dict=job_dict["input"]["generic"])

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/generic.py:1181, in GenericJob.from_dict(self, job_dict)
   1179 if "import_directory" in job_dict.keys():
   1180     self._import_directory = job_dict["import_directory"]
-> 1181 self._server.from_dict(server_dict=job_dict["server"])
   1182 if "executable" in job_dict.keys() and job_dict["executable"] is not None:
   1183     self.executable.from_dict(job_dict["executable"])

File /cmmc/ptmp/pyironhb/mambaforge/envs/pyiron_mpie_cmti_2024-09-02/lib/python3.11/site-packages/pyiron_base/jobs/job/extension/server/generic.py:606, in Server.from_dict(self, server_dict)
    604 if "accept_crash" in server_dict.keys():
    605     self._accept_crash = server_dict["accept_crash"] == 1
--> 606 self._new_hdf = server_dict["new_h5"] == 1

KeyError: 'new_h5'

jan-janssen commented 6 days ago

The job was created with a more recent pyiron version and it cannot be loaded with an older version. Can you try to use the latest pyiron version?

skatnagallu commented 6 days ago

yes with latest version, there is no issue, after I restarted my server. I was able to load and create a new restarted job from the old one. I was trying to reproduce the issue, using an older kernel.

niklassiemer commented 6 days ago

Thanks a lot! This issue may be closed since it is not related to pyiron code (just the old version not being able to load new versions, which is fair). I do not yet understand why the latest kerel was not the expected one and more severe seemed to change to an older version?! However, that seems to be a cmti/jupyter configuration problem.

jan-janssen commented 6 days ago

So restarting the jupyter hub has solved the issue :) . yes I do have additional environment specifications. But now since I restarted the jupyter hub I dont think I can reproduce the issue .

Somehow I missed this part - that is great news.

pyiron / pyiron_base

Cannot (re)start SPHInX jobs on cluster #1654