reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
219 stars 103 forks source link

Too many open files #2233

Open branfosj opened 2 years ago

branfosj commented 2 years ago

I hit: OSError: [Errno 24] Too many open files when running lots of, quite short, tests.

Total tests: 1326, each of about 30 seconds. I'm doing pairwise network tests on a whole rack of new nodes, with max_jobs set to 20.

In the log I see:

[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.013s compile: 0.017s run: 18.778s sanity: 0.015s performance: 0.024s total: 1625.969s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m    FAIL^[[0m ] (1015/1326) 2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u09a on bluebear:icelake using none [compile: 0.016s run: 46.224s total: 1623.775s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u09a'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.013s compile: 0.016s run: 46.224s sanity: n/a performance: n/a total: 1623.775s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m    FAIL^[[0m ] (1016/1326) 2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u10a on bluebear:icelake using none [compile: 0.018s run: 47.207s total: 1623.224s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u10a'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.012s compile: 0.018s run: 47.207s sanity: n/a performance: n/a total: 1623.224s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m    FAIL^[[0m ] (1017/1326) 2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u11b on bluebear:icelake using none [compile: 0.019s run: 48.527s total: 1622.391s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u11b'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.012s compile: 0.019s run: 48.527s sanity: n/a performance: n/a total: 1622.391s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m    FAIL^[[0m ] (1018/1326) 2020a-gompi-osu-mpi-bear-pg0104u07a-bear-pg0104u23a on bluebear:icelake using none [compile: n/a run: n/a total: 0.013s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'compile': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07a-bear-pg0104u23a'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.013s compile: n/a run: n/a sanity: n/a performance: n/a total: 0.013s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m    FAIL^[[0m ] (1019/1326) 2020a-gompi-osu-mpi-bear-pg0104u07a-bear-pg0104u20b on bluebear:icelake using none [compile: n/a run: n/a total: 0.013s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'compile': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07a-bear-pg0104u20b'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.013s compile: n/a run: n/a sanity: n/a performance: n/a total: 0.013s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m FAILED ^[[0m ] Ran 1019/1326 test case(s) from 1326 check(s) (5 failure(s), 0 skipped)
[2021-10-22T13:19:37] info: reframe: [==========] Finished on Fri Oct 22 13:19:37 2021
[2021-10-22T13:19:37] info: reframe: ==============================================================================
[2021-10-22T13:19:37] info: reframe: SUMMARY OF FAILURES
[2021-10-22T13:19:38] info: reframe: ------------------------------------------------------------------------------

So we fail at that point and that last 300 tests are not run.

branfosj commented 2 years ago

It also fails with max_jobs set to 2000 - more than the number of jobs being run. The first failure was also seen at the same number (1015) of tests:

[2021-10-22T15:25:26] info: reframe: [ ^[[32m      OK^[[0m ] (1014/1326) 2020a-gompi-osu-mpi-bear-pg0104u32a-bear-pg0104u34b on bluebear:icelake using none [compile: 0.007s run: 209.231s total: 209.285s]
[2021-10-22T15:25:26] verbose: reframe: ==> timings: setup: 0.032s compile: 0.007s run: 209.231s sanity: 0.020s performance: 0.022s total: 209.285s
[2021-10-22T15:25:26] info: reframe: [ ^[[31m    FAIL^[[0m ] (1015/1326) 2020a-gompi-osu-mpi-bear-pg0104u26b-bear-pg0104u29a on bluebear:icelake using none [compile: 0.007s run: 229.943s total: 229.988s]
[2021-10-22T15:25:26] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u26b-bear-pg0104u29a'
vkarak commented 2 years ago

Could you check what are the file limits in your test system?

teojgo commented 2 years ago

@branfosj you can get with ulimit -n. I suspect it's 1024

branfosj commented 2 years ago

@teojgo Yes, it is:

$ ulimit -n
1024
teojgo commented 2 years ago

@branfosj could you try increasing it. Check your hard limit ulimit -H -n and increase the limit to it ulimit -n <new_limit>

vkarak commented 2 years ago

@branfosj Do your tests have dependencies?

branfosj commented 2 years ago

I've set off a run with an increased ulimit -n.

None of my tests have dependencies.

branfosj commented 2 years ago

With ulimit -n 2000 my tests complete successfully.