[BUG] gemm example fails with problem size that does not fit in single memory of a single gpu

Software versions

Python      :  3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0]
Platform    :  Linux-5.14.21-150400.24.81_12.0.87-cray_shasta_c-x86_64-with-glibc2.31
Legion      :  v24.01.00.dev-33-g1d0265c
Legate      :  24.01.00.dev+33.g1d0265c
WARNING: Disabling control replication for interactive run
Disable Control Replication
Cunumeric   :  24.01.00.dev+16.gb0738142
Numpy       :  1.26.4
Scipy       :  1.12.0
Numba       :  0.59.0
CTK package :  cuda-version-12.2-he2b69de_2 (conda-forge)
GPU driver  :  525.105.17
GPU devices :
  GPU 0: NVIDIA A100-SXM4-80GB
  GPU 1: NVIDIA A100-SXM4-80GB
  GPU 2: NVIDIA A100-SXM4-80GB
  GPU 3: NVIDIA A100-SXM4-80GB

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

I'm trying to run a cunumeric example that uses arrays that do not fit into memory of a single GPU. I'm starting with the gemm example which runs without error for problem sizes below the memory limit of a single GPU but it seems to fail when I try to scale beyond a single GPU. For example, this seems to work fine (frame buffer memory size of 36250):

> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 50000
...
Problem Size:     M=50000 N=50000 K=50000
Total Iterations: 100
Total Flops:      249997.5 GFLOPS/iter
Total Size:       30000.0 MB
Elapsed Time:     203771.428 ms
Average GEMM:     2037.7142800000001 ms
FLOPS/s:          122685.25693405847 GFLOPS/s

Observed behavior

Increasing the problem size results in an error such as:

(legate) dmargala@perlmutter:login40:/pscratch/sd/d/dmargala/work/cunumeric> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000
...

[0 - 7f47a9e9b000]    2.473642 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 3600000000 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 188).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 7f47a9e9b000]    2.473667 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 0, process 626675 (thread 7f47a9e9b000) - obtaining backtrace

Example code or instructions

I've set up my environment using the nv-legate/quickstart recipe for Perlmutter. I'm also using the quickstart run script to run. For example:

INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000

Stack traceback or browser console output

(legate) dmargala@perlmutter:login40:/pscratch/sd/d/dmargala/work/cunumeric> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000
Redirecting stdout, stderr and logs to /pscratch/sd/d/dmargala/2024/02/15/103417
Submitted: salloc -q interactive_ss11 -C gpu --gpus-per-node 4 --ntasks-per-node 1 -c 128 -J legate -A nstaff -t 60 -N 2 /pscratch/sd/d/dmargala/work/quickstart/legate.slurm legate --launcher srun --cpus 1 --sysmem 4000 --gpus 4 --fbmem 36250 --verbose --log-to-file --nodes 2 --ranks-per-node 1 examples/gemm.py -n 60000
salloc: Granted job allocation 21765749
salloc: Waiting for resource configuration
salloc: Nodes nid[200413,200416] are ready for job
Job ID: 21765749
Submitted from: /pscratch/sd/d/dmargala/work/cunumeric
Started on: Thu 15 Feb 2024 10:34:24 AM PST
Running on: nid[200413,200416]
Command: legate --logdir /pscratch/sd/d/dmargala/2024/02/15/103417 --launcher srun --cpus 1 --sysmem 4000 --gpus 4 --fbmem 36250 --verbose --log-to-file --nodes 2 --ranks-per-node 1 examples/gemm.py -n 60000

--- Legion Python Configuration ------------------------------------------------

Legate paths:
  legate_dir       : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
  legate_build_dir : None
  bind_sh_path     : /pscratch/sd/d/dmargala/legate/bin/bind.sh
  legate_lib_path  : /pscratch/sd/d/dmargala/legate/lib

Legion paths:
  legion_bin_path       : /pscratch/sd/d/dmargala/legate/bin
  legion_lib_path       : /pscratch/sd/d/dmargala/legate/lib
  realm_defines_h       : /pscratch/sd/d/dmargala/legate/include/realm_defines.h
  legion_defines_h      : /pscratch/sd/d/dmargala/legate/include/legion_defines.h
  legion_spy_py         : /pscratch/sd/d/dmargala/legate/bin/legion_spy.py
  legion_python         : /pscratch/sd/d/dmargala/legate/bin/legion_python
  legion_prof           : /pscratch/sd/d/dmargala/legate/bin/legion_prof
  legion_module         : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
  legion_jupyter_module : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages

Versions:
  legate_version : 24.01.00.dev+33.g1d0265c

Command:
  srun -n 2 --ntasks-per-node 1 /pscratch/sd/d/dmargala/legate/bin/bind.sh --launcher srun -- /pscratch/sd/d/dmargala/legate/bin/legion_python -ll:py 1 -ll:gpu 4 -cuda:skipbusy -ll:util 2 -ll:bgwork 2 -ll:csize 4000 -ll:fsize 36250 -ll:zsize 32 -level openmp=5,gpu=5 -logfile /pscratch/sd/d/dmargala/2024/02/15/103417/legate_%.log -errlevel 4 -lg:eager_alloc_percentage 50 examples/gemm.py -n 60000

Customized Environment:
  CUTENSOR_LOG_LEVEL=1
  GASNET_MPI_THREAD=MPI_THREAD_MULTIPLE
  LEGATE_MAX_DIM=4
  LEGATE_MAX_FIELDS=256
  LEGATE_NEED_CUDA=1
  LEGATE_NEED_NETWORK=1
  NCCL_LAUNCH_MODE=PARALLEL
  PYTHONDONTWRITEBYTECODE=1
  PYTHONPATH=/pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages:/pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
  REALM_BACKTRACE=1

--------------------------------------------------------------------------------

[0 - 7f47a9e9b000]    2.473642 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 3600000000 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 188).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 7f47a9e9b000]    2.473667 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 0, process 626675 (thread 7f47a9e9b000) - obtaining backtrace
[1 - 7fdd7c0b1000]    2.473897 {5}{cunumeric.mapper}: Mapper cunumeric on Node 1 failed to allocate 3600000000 bytes on memory 1e00010000000004 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 189).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[1 - 7fdd7c0b1000]    2.473923 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 1, process 1705264 (thread 7fdd7c0b1000) - obtaining backtrace
Signal 6 received by process 626675 (thread 7f47a9e9b000) at: stack trace: 17 frames
  [0] = raise at unknown file:0 [00007f47ba306d2b]
  [1] = abort at unknown file:0 [00007f47ba3083e4]
  [2] = legate::mapping::BaseMapper::report_failed_mapping(Legion::Mappable const&, unsigned int, Realm::Memory, int, unsigned long) [clone .cold] at unknown file:0 [00007f474db5d69e]
  [3] = legate::mapping::BaseMapper::map_legate_store(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, legate::mapping::StoreMapping const&, std::set<Legion::RegionRequirement const*, std::less<Legion::RegionRequirement const*>, std::allocator<Legion::RegionRequirement const*> > const&, Realm::Processor, Legion::Mapping::PhysicalInstance&, bool) at unknown file:0 [00007f474db75d30]
  [4] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&)::{lambda(bool)#1}::operator()(bool) const at unknown file:0 [00007f474db7605f]
  [5] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&) at unknown file:0 [00007f474db766f5]
  [6] = legate::mapping::BaseMapper::map_task(Legion::Internal::MappingCallInfo*, Legion::Task const&, Legion::Mapping::Mapper::MapTaskInput const&, Legion::Mapping::Mapper::MapTaskOutput&) at unknown file:0 [00007f474db7722c]
  [7] = Legion::Internal::MapperManager::invoke_map_task(Legion::Internal::TaskOp*, Legion::Mapping::Mapper::MapTaskInput*, Legion::Mapping::Mapper::MapTaskOutput*, Legion::Internal::MappingCallInfo*) at unknown file:0 [00007f47bf1c5be7]
  [8] = Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*) at unknown file:0 [00007f47bf12509e]
  [9] = Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007f47bf12bbbc]
  [10] = Legion::Internal::PointTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007f47bf12c0d3]
  [11] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007f47bf2d2e96]
  [12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007f47bd944cb8]
  [13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007f47bd944d55]
  [14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007f47bd9432d6]
  [15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007f47bd949ae9]
  [16] = unknown symbol at unknown file:0 [00007f47ba31d73d]
Signal 6 received by process 1705264 (thread 7fdd7c0b1000) at: stack trace: 17 frames
  [0] = raise at unknown file:0 [00007fdd8bd06d2b]
  [1] = abort at unknown file:0 [00007fdd8bd083e4]
  [2] = legate::mapping::BaseMapper::report_failed_mapping(Legion::Mappable const&, unsigned int, Realm::Memory, int, unsigned long) [clone .cold] at unknown file:0 [00007fdd3d57569e]
  [3] = legate::mapping::BaseMapper::map_legate_store(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, legate::mapping::StoreMapping const&, std::set<Legion::RegionRequirement const*, std::less<Legion::RegionRequirement const*>, std::allocator<Legion::RegionRequirement const*> > const&, Realm::Processor, Legion::Mapping::PhysicalInstance&, bool) at unknown file:0 [00007fdd3d58dd30]
  [4] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&)::{lambda(bool)#1}::operator()(bool) const at unknown file:0 [00007fdd3d58e05f]
  [5] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&) at unknown file:0 [00007fdd3d58e6f5]
  [6] = legate::mapping::BaseMapper::map_task(Legion::Internal::MappingCallInfo*, Legion::Task const&, Legion::Mapping::Mapper::MapTaskInput const&, Legion::Mapping::Mapper::MapTaskOutput&) at unknown file:0 [00007fdd3d58f22c]
  [7] = Legion::Internal::MapperManager::invoke_map_task(Legion::Internal::TaskOp*, Legion::Mapping::Mapper::MapTaskInput*, Legion::Mapping::Mapper::MapTaskOutput*, Legion::Internal::MappingCallInfo*) at unknown file:0 [00007fdd90b7abe7]
  [8] = Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*) at unknown file:0 [00007fdd90ada09e]
  [9] = Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007fdd90ae0bbc]
  [10] = Legion::Internal::PointTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007fdd90ae10d3]
  [11] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fdd90c87e96]
  [12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007fdd8f2f9cb8]
  [13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007fdd8f2f9d55]
  [14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007fdd8f2f82d6]
  [15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007fdd8f2feae9]
  [16] = unknown symbol at unknown file:0 [00007fdd8bd1d73d]
srun: error: nid200413: task 0: Exited with exit code 1
srun: Terminating StepId=21765749.0
srun: error: nid200416: task 1: Exited with exit code 1
Command completed on: Thu 15 Feb 2024 10:34:44 AM PST
Job finished: Thu 15 Feb 2024 10:34:44 AM PST
salloc: Relinquishing job allocation 21765749
salloc: Job allocation 21765749 has been revoked.

nv-legate / cunumeric