pyiron / pyiron_atomistics

pyiron_atomistics - an integrated development environment (IDE) for atomistic simulation in computational materials science.
https://pyiron-atomistics.readthedocs.io
BSD 3-Clause "New" or "Revised" License
44 stars 15 forks source link

Files are not written for remote jobs #1471

Closed Leimeroth closed 3 months ago

Leimeroth commented 4 months ago

When trying to submit Lammps jobs to a remote cluster only a .h5 file is created, but no input files or working directory. I guess somewhere during restructuring of run functions the necessary call to write_input has gone missing.

EDIT: For VASP it works, so the issue seems to be in the Lammps class.

jan-janssen commented 4 months ago

Can you try to call job.validate_ready_to_run() before submitting the job and check if that solves the issue?

Leimeroth commented 4 months ago

job.validate_ready_to_run() does not seem to change the behavior. Manually doing

os.makedirs(job.working_directory)
job.write_input()

seems to do the job.

Leimeroth commented 4 months ago

For potential that are manually defined via a dataframe the write_input_files_from_input_dict functionality breaks the remote setup because the filepath of the potential does not exist on the remote cluster.

Traceback (most recent call last):
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/__main__.py", line 3, in <module>
    main()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/control.py", line 61, in main
    args.cli(args)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/wrapper.py", line 37, in main
    job_wrapper_function(
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 186, in job_wrapper_function
    job.run()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 131, in run
    self.job.run_static()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/generic.py", line 917, in run_static
    execute_job_with_calculate_function(job=self)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 720, in wrapper
    output = func(job)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 978, in execute_job_with_calculate_function
    ) = job.get_calculate_function()(**job.calculate_kwargs)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 135, in __call__
    self.write_input_funct(
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 80, in write_input_files_from_input_dict
    shutil.copy(source, os.path.join(working_directory, file_name))
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/shutil.py", line 417, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/nfshome/leimeroth/MTP/AlCuZr/Fractions2//10/14//output.14.mtp'

Edit: I use https://github.com/pyiron/pyiron_atomistics/tree/workaround-file-copying as a workaround on the remote hpc right now. As far as I understand the idea of the new workflow is to copy only the hdf5 and write all necessary files on the remote machine, is this correct? If yes, I guess it is necessary to make an exception for potentials that are not part of the default data repository somehow. Also I am somewhat afraid of issues arising due to different pyiron versions/branches/whatever when only writing the files on the remote machine.

Leimeroth commented 4 months ago

bump

jan-janssen commented 4 months ago

Can you be a bit more specific where the potential file is located /nfshome/leimeroth/MTP/AlCuZr/Fractions2//10/14//output.14.mtp is this on the cluster or on the local workstation?

Leimeroth commented 4 months ago

This is the full local path

Leimeroth commented 4 months ago

Regarding file writing I guess the problem is

    def _check_if_input_should_be_written(self):
        if self._job_with_calculate_function:
            return False
        else:
            return not (
                self.server.run_mode.interactive
                or self.server.run_mode.interactive_non_modal

always returning False for lammps, so that

def save(self):
        """
        Save the object, by writing the content to the HDF5 file and storing an entry in the database.

        Returns:
            (int): Job ID stored in the database
        """
        self.to_hdf()
        if not state.database.database_is_disabled:
            job_id = self.project.db.add_item_dict(self.db_entry())
            self._job_id = job_id
            _write_hdf(
                hdf_filehandle=self.project_hdf5.file_name,
                data=job_id,
                h5_path=self.job_name + "/job_id",
                overwrite="update",
            )
            self.refresh_job_status()
        else:
            job_id = self.job_name
        if self._check_if_input_should_be_written():
            self.project_hdf5.create_working_directory()
            self.write_input()
        self.status.created = True
        print(
            "The job "
            + self.job_name
            + " was saved and received the ID: "
            + str(job_id)
        )
        return job_id

does never call write_input

jan-janssen commented 4 months ago

Just as a workaround, can you check if it works by setting:

job._job_with_calculate_function = False
Leimeroth commented 4 months ago

With job._job_with_calculate_function = False the input and an additional WARNING_pyiron_modified_content file are written.

jan-janssen commented 4 months ago

With job._job_with_calculate_function = False the input and an additional WARNING_pyiron_modified_content file are written.

Does the remote submission work when job._job_with_calculate_function = False is set?

Leimeroth commented 4 months ago

Yes, the job is submitted and runs.

EDIT: The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

Leimeroth commented 4 months ago

As the issue is not part of the Lammps class itself, I am confused why it works with VASP

jan-janssen commented 4 months ago

Yes, the job is submitted and runs.

EDIT: The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

Ok, an alternative suggestion would be to add the write_input() call before the remote submission. I tried it in https://github.com/pyiron/pyiron_base/pull/1511 but have not tested it so far.

jan-janssen commented 4 months ago

As the issue is not part of the Lammps class itself, I am confused why it works with VASP

I do not know yet. We had another bug with how restart files are read https://github.com/pyiron/pyiron_base/pull/1509 but that is still work in progress.

Leimeroth commented 4 months ago

Yes, the job is submitted and runs. EDIT: The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

Ok, an alternative suggestion would be to add the write_input() call before the remote submission. I tried it in pyiron/pyiron_base#1511 but have not tested it so far.

Works with the addition of job.project_hdf5.create_working_directory(). Here the warning file is not created

jan-janssen commented 4 months ago

Works with the addition of job.project_hdf5.create_working_directory(). Here the warning file is not created

Great, I think that is the best solution, until we have https://github.com/pyiron/pympipool ready to handle the remote submission.

Leimeroth commented 4 months ago

Do you have an idea how to fix the issue of potentials that are not part of the resources dataframe.

jan-janssen commented 4 months ago

Do you have an idea how to fix the issue of potentials that are not part of the resources dataframe.

I would modify the potential data frame, and maybe just attach the potential as restart file.

jan-janssen commented 3 months ago

@niklassiemer I close this issue, feel free to reopen it if the issue comes up again.

niklassiemer commented 3 months ago

Probably wrong ping @Leimeroth