payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 26 forks source link

Git issue with different initial conditions. #218

Closed josuemtzmo closed 4 years ago

josuemtzmo commented 4 years ago

I'm submitting multiple runs (n=100) of the same overall configuration in MITgcm, however, the initial conditions change for each run. From the 100 simulations that I tried to execute, only 5 run as expected while 95% crashed with git errors.

Is there a flag to stop payu from use git and adding files when it is executed for each experiment?

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.0.2(default)
fatal: Unable to create '/home/156/jm5970/expts/mitgcm_offline_flt/.git/index.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.01/bin/payu-run", line 10, in <module>
    sys.exit(runscript())
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.01/lib/python3.7/site-packages/payu/subcommands/run_cmd.py", line 128, in runscript
    expt.run()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.01/lib/python3.7/site-packages/payu/experiment.py", line 587, in run
    self.runlog.commit()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.01/lib/python3.7/site-packages/payu/runlog.py", line 89, in commit
    cwd=self.expt.control_path)
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.01/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['git', 'add', '/home/156/jm5970/expts/mitgcm_offline_flt/30d_LADV_part_release_00073/config.yaml']' returned non-zero exit status 128.
marshallward commented 4 years ago

Are those runs sharing the same config directory (i.e. the one containing config.yaml and your text inputs)? If so, then I would not expect this to work, and git fails would be the least if your problems.

If you need to do concurrent runs, then each run needs its own config directory with a unique name.

aidanheerdegen commented 4 years ago

What @marshallward said ... but you can use git clone and git branch to keep all the configuration in a single repo, but clone to different directory names if that helps.

josuemtzmo commented 4 years ago

The runs share a similar config file :

ncpus: 1
mem: 50GB
walltime: 07:00:00
jobname: PR_LAV_{0}
project: x77
queue: express
qsub_flags: -lother=hyperthread -W umask=027 -l storage=gdata/v45+scratch/v45+gdata/hh5+gdata/x77

model: mitgcm
shortpath: /scratch/x77
exe: mitgcm_HR_satellite_P_release
input: global_particle_release/30d/30d_slice_chunk_{0}

collate: True
userscripts:
  archive: clear_archive.sh

However, each submission is executed in individual folders, with a unique config.yaml file (I'm replacing {0} with the corresponding experiment run) generated by my submission script:

#!/bin/bash

#Load modules & global variables
module use /g/data3/hh5/public/modules
module load conda/analysis3-unstable

globalpath=`pwd`
count=0
cc=0
n=25
particle_grid='flt_global_hex_032deg.bin'

# input path
input_path='/scratch/x77/jm5970/mitgcm/input/global_particle_release'

# Loop for every initialization of the particle release:
for tt in `seq 0 100`
do
  # Create folder for running experiment.
  folder="30d_LADV_part_release_$(printf %05d ${tt%})"
  mkdir $folder
  # Modify corresponding files to setup the experiment.
  cp ./input/* $folder/.
  sed s-input_off-'.'-g input/data.off > "$folder/data.off"
  sed s-flt_global_hex_10deg.bin-${particle_grid}-g input/data.flt > "$folder/data.flt"
  sed s-{0}-$(printf %05d ${tt%})-g config_sed.yaml > "$folder/config.yaml"
  sed s-{0}-30d_slice_chunk_$(printf %05d ${tt%})-g input/clear_archive.sh > "$folder/clear_archive.sh"
  cd $folder

  ln -s $input_path/${particle_grid} $input_path/30d/30d_slice_chunk_$(printf %05d ${tt%})/

  # Run the experiment.
  payu run -i 0

  cd $globalpath

  count=$((count+1))

  # Sleep for 1 hours so the process can be executed, without over queuing PBS.
  if [ $cc -eq $n ]
    then
      cc=0
      echo "Sleep submission"
      sleep 1h
  else
     cc=$((cc+1))
  fi

done

So I think I'm following the expected workflow of payu.

aidanheerdegen commented 4 years ago

I can't access your directory, but I'm guessing you're making control subdirectories within a directory that is itself a git repo.

So either make the control directories somewhere else, or add runlog: False to your config.yaml so that it doesn't do git stuff.

josuemtzmo commented 4 years ago

Yes, I'm creating control subdirectories within a git repo directory, I've also changed the group of the folder to 'v45' so perhaps you can access it now. I'm resubmitting the jobs with the flag hoping it will solve the issue.

aidanheerdegen commented 4 years ago

Your home directory isn't group readable/executable

josuemtzmo commented 4 years ago

I've changed the rights. Thanks for pointing this out!

josuemtzmo commented 4 years ago

I'm closing this issue as using the flag or moving the files solved it.