Open ocgabs opened 3 years ago
Run with r4.0-HEAD was failing and giving error with shared xios Error [CObjectFactory::GetObject(const StdString & id)] : In file '/work/n01/n01/acc/WORK_XIOS/ucx/xios-2.5/src/object_factory_impl.hpp I recompiled with r4.0.4 and used locally compiled xios too. Problem then was related to the .xlm files, solved when using Valerie's files
The run produced results for 3 months
export CONFIG=Caribbean
export WORK=/work/n01/n01
export WDIR=/work/n01/n01/$USER/$CONFIG
export CDIR=$WDIR/NEMO4/r4.0-HEAD/cfgs
export TDIR=$WDIR/NEMO4/r4.0-HEAD/tools
export Trunk=$WDIR/NEMO4/r4.0-HEAD
We usually run NEMO on copies of the EXP00 or EXPREF folders that get created or populated during compilation under ..../cfgs/$CONFIG (Caribbean in this case).
cd $CDIR/$CONFIG
export EXP=EXP01
(choose here the name you want, this is the name of your experiment run directory)cp -r EXP00 $EXP
The required xml files should be already there, as well as example namelists and the linked executable nemo.exe
If is not there, the original upon compilation goes to .... /cfgs/$CONFIG/BLD/bin you can copy it or link to it.
cd $EXP
`ln -s $CDIR/BLD/bin/nemo.exe ./nemo.exeYou also need an executable of XIOS in your EXP folder. You should have one at generic location/work/n01/n01/$USER/XIOS/bin or wherever you compiled it, copy it or link it. i.e.
ln -s $WORK/XIOS/bin/xios_server.exe ./xios_server.exe
Then you need a configuration specific namelist_cfg. This file contains all the parameter specific to the run, a myriad of switches you can turn on or off and also points to the input data files, output directories, etc. (Remember: whatever is not specified in namelist_cfg file, is taken from the namelist_ref which has default values for everything needed. In this files, everything after the ! is a comment). Get the namlist_cfg file from Valerie:
cp /work/n01/n01/valegu/CARIBBEAN_NEMO_4_0_4/nemo/cfgs/Caribbean/EXP_FullOcean/namelist_cfg namelist_cfg
Feel free to modify the namrun parameters at the top to define the length of your run, etc, should be self explanatory... numbers there are in time steps (rn_rdt = 240 seconds).
Going through this namelist_cfg you will find all the input files being used, their names and paths ./ in this namelist_cfg file is Valerie's EXP folder =/work/n01/n01/valegu/CARIBBEAN_NEMO_4_0_4/nemo/cfgs/Caribbean/EXP_FullOcean
You should link all used files and folders to Valerie's files & folders to "have them" in your $EXP
In the end you should have the following files as links in your $EXP: domain_cfg.nc = This has both the vertical and horizontal grid information coordinates.bdy.nc =This has the open boundary information
And this folders: IC =Initial Conditions (not really used if restarting) restarts = Restarts files are grabbed from here and usually also stored here ERA5_NEW = This is the atmospheric forcing BDY_COPERNICUS = Open boundary forcing TIDE = Tidal forcing
Then you need a runscript... runscript generation is rather complicated on ARCHER2 you can read about it here: https://docs.archer2.ac.uk/research-software/nemo/nemo/#building-a-run-script basically you use another script called mkslurm to generate it. At this point you define how many nodes/cores you will use for running NEMO and how many for XIOS, the spacing and idle cores, etc...
For now let's work with Valerie's runscript: cp /work/n01/n01/valegu/CARIBBEAN_NEMO_4_0_4/nemo/cfgs/Caribbean/EXP_FullOcean/ runscript.slurm $EXP
It is a text file, you can edit the top part to define the computer time you are requesting, your budget code, etc. For now just change Val's email to your's for notifications.
To submit your run you do: sbatch runscript.slurm
To monitor your jobs on the supercomputer: squeue -u $USER
Since there is not a huge que at the moment you should be able to see your job changing from Q= queuing to R= running and after some time netCDF output files should appear in your $EXP folders as well as other files generated at runtime. Do:
ls -ltrh $EXP
To see your new files -l =long, -tr =reverse time order, -h =human readable units (MB, GB). The most interesting file is ocean.output, is the run log here is where you look for / E R R O R S but non is expected here!! solver.stat is a 1 line summary per time step you can use for sanity checking the run. There will also be a slurm-89364.out, that number matches your job number in the supercomputer... this is the job log, nasty things like segmentation faults show up here at the end of the file... but the most common reason runs end before finishing is because the requested cpu time is exceeded (the full run you defined in namelist_cfg didn't finish in the time your requested on the runscript).Now you have to rebuild your netCDF output files, transfer them to livljobs and analyze away!