Realtime job output monitoring

ligerzero-ai commented 2 years ago

Summary

Monitor in real-time the output of a job.

for example, if I call mpi_run vasp_std_5.4.4 > vasp.log in the job script (currently time.out?), I want to be able to see what is being written to this vasp.log file in real time.

Detailed Description

This allows the real-time monitoring of job-output, for example, immediately after submission. This is very nice for troubleshooting for particularly troublesome calculations, or also for calculations which are technically difficult (e.g. checking convergence in real time, killing jobs that don't exhibit electronic convergence behaviour).

I would be able to do this by calling tail -f vasp.log/OUTCAR etc. of a running job via the terminal.

If this is not possible, I would like to be able to change the pyiron job name that appears in the queue so I know exactly where to look in the folder structure for still-running calculations that could be failing to converge properly, without having additional work of querying the queueing system for it.

ligerzero-ai commented 2 years ago

also, while we're on this topic, is it possible to kill calculations without deleting the already-written output? I notice that @samwaseda has put in a request for this feature before, so maybe it is something that could be included in this kind of QOL enhancement suite.

This is useful for preserving already-written output of runs which exhibit poor electronic convergence behaviour, but also kills the job so as to not waste compute time. This way you have a record of what parameters have already been tried etc.

https://github.com/pyiron/pyiron_base/issues/922

jan-janssen commented 2 years ago

Deleting the job from the queuing system is possible using the commands in the backend but not recommended:

from pyiron_base.jobs.job.extension.server.queuestatus import queue_delete_job
queue_delete_job(item=job.server.queue_id)

ligerzero-ai commented 2 years ago

Can you explain why it is not recommended?

Deleting the job from the queuing system is possible using the commands in the backend but not recommended:
from pyiron_base.jobs.job.extension.server.queuestatus import queue_delete_job
queue_delete_job(item=job.server.queue_id)

jan-janssen commented 2 years ago

Can you explain why it is not recommended?

I am not sure if pyiron correctly recognizes the job as aborted or whether it keeps the status running and we have to check if the output is parsed for aborted jobs, as commonly this is only done for finished jobs.

pmrv commented 2 years ago

I do find myself checking lammps output while it's running from time to time, so I think this is a good idea. The way to do it would probably be with an interactive widget from ipywidgets. Ping @niklassiemer because he has most experience with those, but I would think it's pretty straight forward to make a text output that updates itself from time to time.

jan-janssen commented 2 years ago

for example, if I call mpi_run vasp_std_5.4.4 > vasp.log in the job script (currently time.out?), I want to be able to see what is being written to this vasp.log file in real time.

If I remember correctly the monty package from pymatgen is reading the output files backwards, which can be helpful to get only the delta that changed rather than loading the whole file.

In general I have the feeling that the parsing of the output files should be more separated from the storing of the output in the HDF5 format. Basically there should be an output parser class which only returns the output and then a second class which stores the output in the HDF5 format, this would also allow us to store the output of multiple calculation in the HDF5 format.

pmrv commented 2 years ago

Agree that this would be the best version to have for output parsing and HDF stuff, but I think this is issue is slightly parallel to that: sometimes I just want to have a super quick peek at, say, the lammps log of a running job. So I do job['log.lammps'][-100:] in a cell that I execute manually a bunch of times. Having a log.tail('log.lammps'), would be cool I think.

pyiron / pyiron_atomistics

Realtime job output monitoring #782