mabarnes / moment_kinetics

Other
2 stars 4 forks source link

Fix anyv parallelisation #222

Closed johnomotani closed 3 months ago

johnomotani commented 3 months ago

Fixes #221.

mrhardman commented 3 months ago

Thanks for looking at this!

Just checking this PR by running

mpirun -n 8 julia -O3 --project -Jmoment_kinetics.so test_scripts/2D_FEM_assembly_test.jl

with the main block of the script modified to run something very quick

    #run_assembly_test() # to ensure routines are compiled before plots are made
    run_assembly_test(ngrid=3,nelement_list=[8,16,32],plot_scan=true)
    #run_assembly_test(ngrid=5,nelement_list=[4,8,16,32,64],plot_scan=true)
    #run_assembly_test(ngrid=7,nelement_list=[2,4,8,16,32],plot_scan=true)
    #run_assembly_test(ngrid=9,nelement_list=[2,4,8,16],plot_scan=true)

I get print output suggesting that the bug found in the process of looking at #221 is still present.

fkpl_C_G_H_max_test_ngrid_3_GLL.pdf
[[NaN, NaN, 1.0991923347711973e278], [NaN, NaN, 4.939750792114682e278], [NaN, NaN, 3.113470445132266e280], [0.015625, 0.00390625, 0.0009765625], [0.000244140625, 1.52587890625e-5, 9.5367431640625e-7]]
fkpl_coeffs_max_test_ngrid_3_GLL.pdf
[[NaN, NaN, 0.053658820415754704], [NaN, NaN, 2.2676678031144188e252], [NaN, NaN, 1.7546023811400415e278], [NaN, NaN, 1.737347725479605e278], [NaN, NaN, 1.5318408070550807e278], [NaN, NaN, 1.0534565067488919e278], [0.015625, 0.00390625, 0.0009765625], [0.000244140625, 1.52587890625e-5, 9.5367431640625e-7]]
fkpl_C_G_H_L2_test_ngrid_3_GLL.pdf
[[NaN, NaN, Inf], [NaN, NaN, Inf], [NaN, NaN, Inf], [0.015625, 0.00390625, 0.0009765625], [0.000244140625, 1.52587890625e-5, 9.5367431640625e-7]]
fkpl_coeffs_L2_test_ngrid_3_GLL.pdf
[[NaN, NaN, 0.010303082994241385], [NaN, NaN, Inf], [NaN, NaN, Inf], [NaN, NaN, Inf], [NaN, NaN, Inf], [NaN, NaN, Inf], [0.015625, 0.00390625, 0.0009765625], [0.000244140625, 1.52587890625e-5, 9.5367431640625e-7]]
fkpl_conservation_test_ngrid_3_GLL.pdf
[[NaN, NaN, 1.0991923347711973e278], [NaN, NaN, Inf], [NaN, NaN, 3.394497640254212e277], [NaN, NaN, 1.9757152855550637e277], [NaN, NaN, 5.652811699224826e276], [0.015625, 0.00390625, 0.0009765625], [0.000244140625, 1.52587890625e-5, 9.5367431640625e-7]]
fkpl_timing_test_ngrid_3_GLL.pdf
[[56929.0, 65.0, 144.0], [116.0, 944.0, 4156.0], [64.0, 256.0, 1024.0], [512.0, 4096.0, 32768.0]]
mrhardman commented 3 months ago

After fixing moment_kinetics/test/fokker_planck_tests.jl to run standalone, I get this the following error output with the command mpirun -n 8 julia -O3 --project -Jmoment_kinetics.so -e 'include("moment_kinetics/test/fokker_planck_tests.jl")'

Error During Test at Error During Test at Error During Test at /excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:214
  Got exception outside of a @test
  BoundsError: attempt to access 0×0 Matrix{Float64} at index [33, 17]
  Stacktrace:
    [1] getindex
      @ ./essentials.jl:14 [inlined]
    [2] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:284 [inlined]
    [3] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/looping.jl:677 [inlined]
    [4] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:283 [inlined]
    [5] macro expansion
      @ /network/software/linux-x86_64/julia/1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [6] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:215 [inlined]
    [7] macro expansion
      @ /network/software/linux-x86_64/julia/1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [8] runtests()
      @ Main.FokkerPlanckTests ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:73
    [9] top-level scope
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:725
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [11] top-level scope
      @ none:1
   [12] eval
      @ ./boot.jl:385 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [14] _start()
      @ Base ./client.jl:552
 - test weak-form collision operator calculation
/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:214
  Got exception outside of a @test
  BoundsError: attempt to access 0×0 Matrix{Float64} at index [49, 17]
  Stacktrace:
    [1] getindex
      @ ./essentials.jl:14 [inlined]
    [2] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:284 [inlined]
    [3] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/looping.jl:677 [inlined]
    [4] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:283 [inlined]
    [5] macro expansion
      @ /network/software/linux-x86_64/julia/1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [6] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:215 [inlined]
    [7] macro expansion
      @ /network/software/linux-x86_64/julia/1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [8] runtests()
      @ Main.FokkerPlanckTests ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:73
    [9] top-level scope
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:725
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [11] top-level scope
      @ none:1
   [12] eval
      @ ./boot.jl:385 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [14] _start()
      @ Base ./client.jl:552
 - test weak-form collision operator calculation
/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:214
  Got exception outside of a @test
  BoundsError: attempt to access 0×0 Matrix{Float64} at index [17, 17]
  Stacktrace:
    [1] getindex
      @ ./essentials.jl:14 [inlined]
    [2] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:284 [inlined]
    [3] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/looping.jl:677 [inlined]
    [4] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:283 [inlined]
    [5] macro expansion
      @ /network/software/linux-x86_64/julia/1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [6] macro expansion
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:215 [inlined]
    [7] macro expansion
      @ /network/software/linux-x86_64/julia/1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [8] runtests()
      @ Main.FokkerPlanckTests ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:73
    [9] top-level scope
      @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/test/fokker_planck_tests.jl:725
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [11] top-level scope
      @ none:1
   [12] eval
      @ ./boot.jl:385 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [14] _start()
      @ Base ./client.jl:552
johnomotani commented 3 months ago

Sorry, I introduced a bug in test_scripts/2D_FEM_assembly_test.jl because I assumed some arrays were shared-memory when they weren't. Should be fixed by 62e5c92.

The fokker_planck_tests.jl tests were a couple of other little bugs with anyv parallelisation - 23372f5 and f57f3a7.

mrhardman commented 3 months ago

I confirm that your latest fixes remove the bugs (tested the same commands as above, on up to 8 cores). Thank you!

What is the maximum core count that we can use in the check-in tests? I think there is a Fokker-Planck test that we are missing: run a very low resolution case with "too many" cores, and check that the parallelised version gives the same as the result on 1 core. If we use ngrid = 3 and nelement_vpa = 4, nelement_vperp = 2, we should have to use only few cores to reach this limit.

In addition, I wonder if you would be open to another series of tests to check that the test scripts remain functional? I could try to add this myself as an exercise.

johnomotani commented 3 months ago

What is the maximum core count that we can use in the check-in tests? I think there is a Fokker-Planck test that we are missing: run a very low resolution case with "too many" cores, and check that the parallelised version gives the same as the result on 1 core. If we use ngrid = 3 and nelement_vpa = 4, nelement_vperp = 2, we should have to use only few cores to reach this limit.

Last time I checked, the CI servers have 2 physical processes. We can probably try to use as many as we like by oversubscribing the physical cores, but it will probably get slower with more processes, and I wouldn't be surprised if we ran out of memory. The CI currently runs some jobs on 1, 2, 3, or 4 cores. We could do some selected tests on more, but I wouldn't want to run the whole test suite.

In addition, I wonder if you would be open to another series of tests to check that the test scripts remain functional? I could try to add this myself as an exercise.

That would be very welcome!