New batch of "regular" runs - Githubissues

lizzieinvancouver / temporalvar

0 stars 0 forks source link

New batch of "regular" runs #27

Closed donahuem closed 5 years ago

donahuem commented 6 years ago

Sent a new set of regular runs. Each job is a 50-task array. Each task is 200 runs. 54202087 54202148 54202210 54202241

Megan should check on these in the morning to make sure they are running as expected.

lizzieinvancouver commented 6 years ago

Don't pull .... we'll set up some new ones.

donahuem commented 5 years ago

Sent a new set of regular runs. Each job is a 50-task array. Each task is 100 runs. 58137334 58137380 58137427 58137498

donahuem commented 5 years ago

These ran just fine as indicated by std.out which records all 100 runs in each file. However, only a subset of runs were written to summaryOut or to BfinN, etc. I assume that scratch ran out of space? Because the warnings were not writing out, I'm not sure.
Solutions: write out warnings (option(warn=1) added to Phenology.R); turn writeBout to 0 for now. Can go back and rerun for wihtin-year dynamics if its critical

donahuem commented 5 years ago

New set of runs. 4 sets of runs; 50 tasks of 100 runs

58378409 58378458 58378477 58378492

donahuem commented 5 years ago

@lizzieinvancouver You can pull these runs. Note that the number of runs per task ranges from 75-100 b/c of memory issues

donahuem commented 5 years ago

About 85% of the runs in each task were saved. I think I have exceeded the memory allocation in the jobs. The runs vary in time from ~1:30 to 3:30, but all have MaxRSS ~= 100Mb. While I thought i was asking for 1000Mb per run, 100Mb is the default. I think that the white spaces in my slurm job script might mean that those additional specs (i.e., mem=1000) were not included in call, and we got the default memory. Just a guess. Testing this by sending another batch.

donahuem commented 5 years ago

Next batch of runs! 58550919 58550988 58551048 58551123

donahuem commented 5 years ago

These runs are complete. Not all the runs were saved (~85%). Checked ReqMem (requested memory) and it is 1000Mn for these and all preceding jobs. So much for that idea.

donahuem commented 5 years ago

I had assumed that scratch was separate for each node and named the scratch folder by the jobID. If it is shared, then the first job to finish might be moving all files in the jobID folder. Instead, name folder by jobID-taskID. Rerunning: Submitted batch job 58581670 Submitted batch job 58581688 Submitted batch job 58581745 Submitted batch job 58581780

donahuem commented 5 years ago

These ran all the way through.