Open branfosj opened 2 years ago
It also fails with max_jobs
set to 2000 - more than the number of jobs being run. The first failure was also seen at the same number (1015) of tests:
[2021-10-22T15:25:26] info: reframe: [ ^[[32m OK^[[0m ] (1014/1326) 2020a-gompi-osu-mpi-bear-pg0104u32a-bear-pg0104u34b on bluebear:icelake using none [compile: 0.007s run: 209.231s total: 209.285s]
[2021-10-22T15:25:26] verbose: reframe: ==> timings: setup: 0.032s compile: 0.007s run: 209.231s sanity: 0.020s performance: 0.022s total: 209.285s
[2021-10-22T15:25:26] info: reframe: [ ^[[31m FAIL^[[0m ] (1015/1326) 2020a-gompi-osu-mpi-bear-pg0104u26b-bear-pg0104u29a on bluebear:icelake using none [compile: 0.007s run: 229.943s total: 229.988s]
[2021-10-22T15:25:26] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u26b-bear-pg0104u29a'
Could you check what are the file limits in your test system?
@branfosj you can get with ulimit -n
. I suspect it's 1024
@teojgo Yes, it is:
$ ulimit -n
1024
@branfosj could you try increasing it. Check your hard limit ulimit -H -n
and increase the limit to it ulimit -n <new_limit>
@branfosj Do your tests have dependencies?
I've set off a run with an increased ulimit -n
.
None of my tests have dependencies.
With ulimit -n 2000
my tests complete successfully.
I hit:
OSError: [Errno 24] Too many open files
when running lots of, quite short, tests.Total tests: 1326, each of about 30 seconds. I'm doing pairwise network tests on a whole rack of new nodes, with
max_jobs
set to 20.In the log I see:
So we fail at that point and that last 300 tests are not run.