pyiron / pyiron_base

Core components of the pyiron integrated development environment (IDE) for computational materials science
https://pyiron-base.readthedocs.io
BSD 3-Clause "New" or "Revised" License
20 stars 14 forks source link

Whether the task can be scheduled? #923

Closed appassionate closed 1 year ago

appassionate commented 2 years ago

Hi all, pyiron is really a nice work!! : )

some question for it: i dont want to submit too much lammps jobs in a HPC in a time.

For example : if i have 100 jobs, there will be 100 slurm tasks in HPC if jobs are submitted meanwhile. Is there any implements for just control the running/pend job numbers in HPC? some of jobs is scheduled, and some jobs is in slurm. Or I have missed some insteresting feature of pyiron? Many thanks!

pmrv commented 2 years ago

Hi @appassionate, thanks for your interest!

We don't have any facilities in pyiron to "pre-queue" jobs before they hit the queuing system and I'm not sure what the advantage would be. Is there anything in your HPC setup that is preventing you from submitting to many jobs?

In any case it's certainly possible to write a script that idles while the other computations are running. That would imply that you have a (small) script constantly running on the head node of your cluster.

If you do not want to submit the jobs because small and quick, have a look at WorkerJob.

appassionate commented 2 years ago

Hi @appassionate, thanks for your interest!

We don't have any facilities in pyiron to "pre-queue" jobs before they hit the queuing system and I'm not sure what the advantage would be. Is there anything in your HPC setup that is preventing you from submitting to many jobs?

In any case it's certainly possible to write a script that idles while the other computations are running. That would imply that you have a (small) script constantly running on the head node of your cluster.

If you do not want to submit the jobs because small and quick, have a look at WorkerJob.

Thanks for your generous reply! @pmrv HPC will be busy if too much people submit their tasks. Some tasks are light, and will be quickly finished. there will be messy in cmdline if too many tasks pend meanwhile... I think batch running more jobs in one slurm/lsf script will be also suitable for me. Can "workerJob" handle with such a "batch-submit" condition? haha, maybe i need try first. :)

jan-janssen commented 2 years ago

@appassionate To explain the worker job a bit more, the idea is that you ask for a couple of nodes and then within this allocation the worker job is distributing the tasks, typically running one calculation per node. For this distribution we use the SLURM internal srun logic. In particular on large computing clusters there is commonly a reduction in computing cost charged when asking for larger allocations (+500 nodes) and that is what the worker job is designed for.

Can you explain a bit more what kind of calculations you plan to submit in the individual jobs? DFT? MD?

appassionate commented 2 years ago

@appassionate To explain the worker job a bit more, the idea is that you ask for a couple of nodes and then within this allocation the worker job is distributing the tasks, typically running one calculation per node. For this distribution we use the SLURM internal srun logic. In particular on large computing clusters there is commonly a reduction in computing cost charged when asking for larger allocations (+500 nodes) and that is what the worker job is designed for.

Can you explain a bit more what kind of calculations you plan to submit in the individual jobs? DFT? MD?

Hi@jan-janssen Thanks for your hint of WokerJob! I can image that if I have much clusters, WorkerJob will help me a lot to distribute Jobs in pyiron. But I just have few nodes, DFT or MD tasks is time consuming which is ok for submitted as one slurm/lsf task (mainly use lsf), other tasks will be some anlaysis script (in python, some are parallel) which will be quickly finished, if i trans those task as "pyironic" Job, too much pend will happen. "batch-submit" will be helpful for me, i guess. It might seems like that:

...( slurm settings)...
{{command_0}}
{{command_1}}
{{command_...}} # command according to a  special task, which might be generated by some abstraction?

and then WorkerJob will help me to monitor those job status. ok, I just imaged.. : ) Anyway, many thanks.

niklassiemer commented 2 years ago

There is also the ScriptJob wich allows to send a full notebook/script to slurm to be executed. Inside, you could define different jobs and run and analyse them.

appassionate commented 2 years ago

There is also the ScriptJob wich allows to send a full notebook/script to slurm to be executed. Inside, you could define different jobs and run and analyse them.

@niklassiemer Thanks! I tried ScirptJob up to now. It is very conveinent to determine the params in script!

appassionate commented 2 years ago

@pmrv @jan-janssen Hi, thanks for your detailed help of WorkerJob. I believe the "batch submit" thinking of mine seem to be naive after more understaning of WorkerJob. In my case of WorkerJob, I use it in a HPC queue. Am I doing right? the WorkerJob example is using the non_modal way. I believe the WokerJob on the queue will be a "Pyironic" worker to execute lammps or other calculation tasks.

job_worker = pr_worker.create.job.WorkerJob("gpu")
job_worker.server.queue = "gpu"
job_worker.server.cores = 24
job_worker.run()

and then, the woker job in "worker" project is running. Another prob for me is that the "calculation" job seems to be always submitted which not be actually executed by workerjob. Or I have missed something? Many thanks.

appassionate commented 2 years ago

If I guess right, "WorkerJob" need to run "non_modal" way for the jupyter kernel backend parallel using? Is possible that WorkerJob can run in a HPC queue by pysqa to do the heavy work, which might be like Dask handling HPC.... :)

appassionate commented 1 year ago

Hi, long time no see in this issue. It is interesting to use dask cluster client in this notebook example of pylammpsmpi! https://github.com/pyiron/pylammpsmpi/blob/master/notebooks/lammps_local_cluster.ipynb

lmp = LammpsLibrary(cores=10, mode='dask', client=client)

On the other hand, would it be reasonable to implement dask client resources in pyiron_base? Like:

job.server.dask_client = a_existing_client
job.server.run_mode = "dask"

dask.distributed will be a good interface in HPC resources scheduling. We just run a SLURM/LSF script to get a dask client within tcp address and we can address it like this:

from dask.distributed import get_client
client = get_client()

I believe such a method in client.submit(object) means object should be pickable.

ligerzero-ai commented 1 year ago

Hi there!

Thanks for your interest in pyiron :)

We have no plans to implement any scheduling functionality in pyiron, because we believe this to be the responsibility of the job scheduler in HPC systems. If there are queueing limitations, I guess there could be a need for this feature. However, as I understand this thread, your issue is that you want to have some method to record the submitted jobs so you can manage them in either the pre-submitted as well as the submitted state.

In this case, can I suggest a small class that you can write on your own? For example, you could write a log of "pre-submitted" jobs to a file on disk

class JobSubmissionManager():

def __init__(project, job_limit):
self.project = project
self.job_limit = job_limit
self.running_df = None

@property
def running_df(self):
df = self.project.job_table()
df_run = df[df["status"] == "running"]
...

def write_log()
""" writes log of pre_submitted jobs """

def submit()
""" reads the written JobSubmissionManager.log on disk and submits jobs listed in file```

The job submission manager can get around the self imposed or the user-imposed limit by being an interface between the user and job submission of a project.

I think you can get around the job creation problem by simply hacking job.run(run_mode = "manual"), instead of job.run(run_mode = "queue") as you usually would for a for-loop job submission. Then, you can feed the project in the JobSubmissionManager and use it as the real submission manager. So I would envision that the JobSubmissionManager writes a log initially with all the jobs that have yet to be run on the disk "pre-queue", and you can then remove the jobs as they get submitted to the cluster every time you run JobSubmissionManager.submit(). You can manage the job-creation like so: initially, run the job with run_mode "manual", i.e. job.run(run_mode = "manual"), and when JobSubmissionManager uses its submit method, you can do job = pr.load("log_imported_job_name") job.run(run_mode = "queue", delete_existing_job = True). This way you have a log of jobs that have been submitted and are running/finished in pyiron and those that are still in the pre-queue. You can set this up to be cron-like either on the python or the terminal side with a script that runs on the head node.

I hope this helps!

If eventually we get a more elegant solution, or get around to making it because of more demand, I will re-open this issue.