libAtoms / workflow

python workflow toolkit
GNU General Public License v2.0
32 stars 18 forks source link

Successful MD job not transferred to local machine #283

Closed jungsdao closed 8 months ago

jungsdao commented 8 months ago

Hello, I'm running MD by submitting job to GPU node in the remote cluster. The job has been successfully finished in the cluster (which I can check from scratch folder) but somehow it has not been transferred to local machine for unknown reason. So I don't know where I can find my MD trajectory. Is there any way to retrieve this successful MD trajectory and copy it back to my local machine? Many thanks in advance!

bernstei commented 8 months ago

That depends. The workflow should be staging things back if you run it (again) after the job is done. If it's not, it should print an error message, and you should copy and paste that here. However, the MD function doesn't necessarily save the whole trajectory - it depends on what parameters you passed it. If you paste that code snippet here (the one that calls wfl.generate.md.md), I can look.

jungsdao commented 8 months ago
file has vanished: "/raven/ptmp/hjung/GAP/scratch/unkownhost-_home_hjung/run_production_md_chunk_0_jGVBwWxZNVtzi7m2eDga4dPe3hzpzSMC8QgS5F_7x8Y=_os4zkic2/_tmp_expyre_job_succeeded"
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1684) [generator=3.1.3]

Thank you for the reply, I think this is the related error message. Also following is my MD parameters.

from wfl.generate.md import md as sample_md

md_params = {'steps': 200000, # 500
            'dt': 0.5, "integrator":"Langevin",
            'temperature': 573., "temperature_tau" : 100/fs, "verbose": True}

sample_md(in_config, out_config, calculator=calculator, autopara_info= AutoparaInfo(num_python_subprocesses=1, num_inputs_per_python_subprocess=1,remote_info=remote_gpu_info), **md_params)
bernstei commented 8 months ago

That's a strange message, and I don't understand how it could have happened. Have you tried to run the workflow again (so that it tries to stage back the files again)? That error feels like the sort of thing that might happen if it tried to stage back the files at exactly the wrong moment, and trying it again might help.

Anyway, the data is probably there but not in a straightforward format, but I can show you how to extract it, if we can't get the staging back to work.

bernstei commented 8 months ago

Can you also post the output of ls -l '/raven/ptmp/hjung/GAP/scratch/unkownhost-_home_hjung/run_production_md_chunk_0_jGVBwWxZNVtzi7m2eDga4dPe3hzpzSMC8QgS5F_7x8Y=_os4zkic2/' (all one line, no spaces in the file path)

jungsdao commented 8 months ago

This is the output of the command. Perhaps produced MD trajectory is quite large and that might have been delayed when trying to copy it back to local machine.

total 2442502
-rw-r--r-- 1 hjung mfh          0 Jan 13 14:47 _expyre_job_started
-rw-r--r-- 1 hjung mfh 1225042903 Jan 13 16:49 _expyre_job_succeeded
-rw-r--r-- 1 hjung mfh        776 Jan 13 14:46 _expyre_post_run_commands
-rw-r--r-- 1 hjung mfh         28 Jan 13 14:46 _expyre_pre_run_commands
-rw-r--r-- 1 hjung mfh        731 Jan 13 14:46 _expyre_script_core.py
-rw-r--r-- 1 hjung mfh        367 Jan 13 14:47 _expyre_stderr
-rw-r--r-- 1 hjung mfh   11600235 Jan 13 16:47 _expyre_stdout
-rw-r--r-- 1 hjung mfh      10792 Jan 13 14:46 _expyre_task_in.pckl
-rw-r--r-- 1 hjung mfh        754 Jan 13 14:47 job.production_md_chunk_0_jGVBwWxZNVtzi7m2eDga4dPe3hzpzSMC8QgS5F_7x8Y=_os4zkic2.stderr
-rw-r--r-- 1 hjung mfh          0 Jan 13 14:47 job.production_md_chunk_0_jGVBwWxZNVtzi7m2eDga4dPe3hzpzSMC8QgS5F_7x8Y=_os4zkic2.stdout
-rw-r--r-- 1 hjung mfh       2089 Jan 13 14:46 job.script.slurm
bernstei commented 8 months ago

The data is there, encoded in _expyre_job_succeeded, but the temporary file the rsync complained about is gone, so running the workflow again should find the completed job and create the final trajectory file in whatever format you requested in the OutputSpec, presumably extxyz.

Note that it shouldn't even try to copy files back until after the temporary file has been renamed to its permanent name, so I don't see how you could have gotten that error, but I'll think about that issue. Nevertheless, running the script again should work better.

jungsdao commented 8 months ago

When I try again with the same command that ran MD, I think it just starts a new MD rather than copying back what has been succeeded in previous run. Am I missing something in this step?

bernstei commented 8 months ago

That's not supposed to happen, although it might be just misleading stdout messages. Is it actually creating a new directory under /raven/ptmp/hjung/GAP/scratch/unkownhost-_home_hjung/ and submitting a new job, or are you just saying that based on the fact that it's printed out something?

jungsdao commented 8 months ago

I think I should try again to reproduce that. I'll let you know whether it creates new directory or not. I think it creates new directory but I'm not sure yet.

bernstei commented 8 months ago

I think it's possible to break that functionality if you're not careful with things like random seeds, because it can recognize that the seed has changed and determine that the run won't be identical, so therefore won't reuse the previous result.

If that's the issue here, we should think about changing that, or at least documenting it.

jungsdao commented 8 months ago

I was trying to reproduce the error and following is the error message I got. Is there any way to avoid this error?

/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/expyre/func.py:735: UserWarning: Job production_md_chunk_0_PaFv1GC9HL245v83HAz_CB43Yttt2kA-ZmEVRV8Dht4=_cz_6il18 has no _succeeded or _error file, but remote status done is not "queued", "held", or "running". Giving it one more chance.
    26   warnings.warn(f'Job {self.id} has no _succeeded or _error file, but remote status {remote_status} is '
    27 d/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/autoparallelize/remote.py:143: UserWarning: Failed in remote job production_md_chunk_0_PaFv1GC9HL245v83HAz_CB43Yttt2kA-ZmEVRV8Dht4=_cz_6il18 on raven_gpu
    28   warnings.warn(f'Failed in remote job {xpr.id} on {xpr.system_name}')
    29 Traceback (most recent call last):
    30   File "/work/home/hjung/Calculation/4_Free_energy_calculation/3_Ru/0_CHO/1_production/fit_5/production_md.py", line 180, in <module>
    31     run_md(initial_structures[i], outfile, mace_file, **md_params)
    32   File "/work/home/hjung/Calculation/4_Free_energy_calculation/3_Ru/0_CHO/1_production/fit_5/production_md.py", line 124, in run_md
    33     sample_md(in_config, out_config, calculator=calculator,
    34   File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/generate/md/__init__.py", line 262, in md
    35     return autoparallelize(_sample_autopara_wrappable, *args,
    36            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    37   File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/autoparallelize/base.py", line 177, in autoparallelize
    38     return _autoparallelize_ll(autopara_info, inputs, outputs, func, *args, **kwargs)
    39            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    40   File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/autoparallelize/base.py", line 223, in _autoparallelize_ll
    41     out = do_remotely(autopara_info, iterable, outputspec, op, args=args, kwargs=kwargs)
    42           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    43   File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/wfl/autoparallelize/remote.py", line 137, in do_remotely
    44     ats_out, stdout, stderr = xpr.get_results(timeout=remote_info.timeout, check_interval=remote_info.check_interval)
    45                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    46   File "/home/hjung/miniforge3/envs/mace_env/lib/python3.11/site-packages/expyre/func.py", line 731, in get_results
    47     raise ExPyReJobDiedError(f'Job {self.id} has remote status {remote_status} but no _succeeded or _error\n'
    48 expyre.func.ExPyReJobDiedError: Job production_md_chunk_0_PaFv1GC9HL245v83HAz_CB43Yttt2kA-ZmEVRV8Dht4=_cz_6il18 has remote status done but no _succeeded or _error
    49 stdout: No dtype selected, switching to float64 to match model dtype.
jungsdao commented 8 months ago

Also I have resubmitted those failed jobs and jobs are spawned in a new directory rather than retrieving the previous successful job.

bernstei commented 8 months ago

That error message is normally what happens when a job is killed by the queuing system before it's done and doesn't have time to write its normal ending files. If you go into the hidden temporary directory where the job is submitted, you can look inside job.script.* and see what the stdout and stderr files are named, and look at those - queuing system errors are likely to be in there. It might be nice to add those to the wfl output when a job fails.

As for the resubmitted jobs, are you sure that the numpy random seed has the same state when the MD function is called?

bernstei commented 8 months ago

That error message is normally what happens when a job is killed by the queuing system before it's done and doesn't have time to write its normal ending files.

I checked, and on my system when that happens the messages don't end where your text does. There's no way for the function that prints the stdout: line to end right there. After the stdout: message, there should be stderr:, and also lines with jobs_stdout: and job_stderr:. The last of these labels the message that the queuing system gave when it ran out of time.

jungsdao commented 8 months ago

That error message is normally what happens when a job is killed by the queuing system before it's done and doesn't have time to write its normal ending files. If you go into the hidden temporary directory where the job is submitted, you can look inside job.script.* and see what the stdout and stderr files are named, and look at those - queuing system errors are likely to be in there. It might be nice to add those to the wfl output when a job fails.

As for the resubmitted jobs, are you sure that the numpy random seed has the same state when the MD function is called?

When I look into queuing system's error message, there's no particular error message printed to the file.

So I would like to bring successful job which is already in the remote machine. But How can I find numpy random seed of the problematic job and use that seed to copy back to the local machine?

bernstei commented 8 months ago

There's no way to find out that seed, as far as I know, which I agree is bad - that needs to be fixed, and in fact in general the handling of the random seeds needs to be fixed. I opened #284 for that issue.

When this happens, however, you can get the output by copying to your local machine _expyre_job_succeeded from the stage directory (the one that's very large, as shown in your ls output above). Then you should be able to convert it into an xyz file with

import pickle
import ase.io

d = pickle.load(open("_expyre_job_succeeded", "rb"))
ase.io.write("traj.extxyz", d)

Let me know how it works.

jungsdao commented 8 months ago

Thank you, it works and I can save xyzfile. But one problem is that I have no idea from which job this is from in my local machine. Is there any identifier available in output files? For example, the origin directory in local machine or the command used for this job? even if I retrieved this trajectory file, I lost the context (or connection) ...

bernstei commented 8 months ago

There's no identifier in the output file, because there's no guaranteed way for it to generate a unique identifier except hashing the input arguments, and if those include a non-deterministic seed you can't reproduce them. If you had set a sufficiently unique job name in the autopara_info, that's used as part of the name of the stage directory. If you look at the stage directories' dates, that might give you a clue as to which are the newest ons. And the _chunk_N_ in the stage dir name tells you which group of (num_configs_per_queued_job) input configs that particular job is for.

Also, I highly recommend creating a <subdir_where_you_are_running>/_expyre directory where you run the script, so the jobs for that script are separated from other scripts, and they don't just all end up in ~/.expyre/ (but it's too late for that now).

bernstei commented 8 months ago

The fundamental issue is fixed in #285 . Feel free to close this issue unless you have further questions.

jungsdao commented 8 months ago

Thank you for the troubles with solving the issue. I'll need to try again if it works well with finding the jobs. I'll let you know if it works fine.

jungsdao commented 8 months ago

Now even if job has been failed before, it correctly finds corresponding previous job and fetches the data.