lanl / BEE

Other
14 stars 3 forks source link

Checkpoint restart fix and test script #680

Closed jtronge closed 1 year ago

jtronge commented 1 year ago

This should fix issues #631 and #678.

The recent changes in Charliecloud don't seem backward compatible, so I've updated the min version to 0.32. I'm wondering if that's the best way to deal with this?

jtronge commented 1 year ago

A couple things that I forgot to mention:

This adds a new requirement for setting a time limit:

beeflow:SchedulerRequirement:
  timeLimit: 00:00:10

It also adds a new beeflow DockerRequirement extension, beeflow:forceType, which corresponds to the newly required argument for ch-image's --force option.

Also I've fully "internalized" the jinja file, so that it no longer gets copied to the user's config dir.

pagrubel commented 1 year ago

Just starting to look at this. How would one enter other directives such as account as required on Summit, really any system where one has multiple accounts?

jtronge commented 1 year ago

We would probably want to add that as an option under the beeflow:SchedulerRequirement. Maybe we could also set a default in the bee.conf in case somebody will always be using the same one.

pagrubel commented 1 year ago

We would probably want to add that as an option under the beeflow:SchedulerRequirement. Maybe we could also set a default in the bee.conf in case somebody will always be using the same one.

It would need to be done in this PR or BEE becomes unusable on the ORNL platforms

jtronge commented 1 year ago

I added an account option for the beeflow:SchedulerRequirement as well as a new bee.conf section [job] for setting default account and timeout options. I also tested this on Crusher.

Also, to run the checkpoint restart test, first start beeflow and then run python3 ci/integration_test.py --tests checkpoint_restart --timeout 800.