I'm trying to run a cunumeric example that uses arrays that do not fit into memory of a single GPU. I'm starting with the gemm example which runs without error for problem sizes below the memory limit of a single GPU but it seems to fail when I try to scale beyond a single GPU. For example, this seems to work fine (frame buffer memory size of 36250):
> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 50000
...
Problem Size: M=50000 N=50000 K=50000
Total Iterations: 100
Total Flops: 249997.5 GFLOPS/iter
Total Size: 30000.0 MB
Elapsed Time: 203771.428 ms
Average GEMM: 2037.7142800000001 ms
FLOPS/s: 122685.25693405847 GFLOPS/s
Observed behavior
Increasing the problem size results in an error such as:
(legate) dmargala@perlmutter:login40:/pscratch/sd/d/dmargala/work/cunumeric> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000
...
[0 - 7f47a9e9b000] 2.473642 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 3600000000 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 188).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 7f47a9e9b000] 2.473667 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 0, process 626675 (thread 7f47a9e9b000) - obtaining backtrace
Example code or instructions
I've set up my environment using the nv-legate/quickstart recipe for Perlmutter. I'm also using the quickstart run script to run. For example:
(legate) dmargala@perlmutter:login40:/pscratch/sd/d/dmargala/work/cunumeric> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000
Redirecting stdout, stderr and logs to /pscratch/sd/d/dmargala/2024/02/15/103417
Submitted: salloc -q interactive_ss11 -C gpu --gpus-per-node 4 --ntasks-per-node 1 -c 128 -J legate -A nstaff -t 60 -N 2 /pscratch/sd/d/dmargala/work/quickstart/legate.slurm legate --launcher srun --cpus 1 --sysmem 4000 --gpus 4 --fbmem 36250 --verbose --log-to-file --nodes 2 --ranks-per-node 1 examples/gemm.py -n 60000
salloc: Granted job allocation 21765749
salloc: Waiting for resource configuration
salloc: Nodes nid[200413,200416] are ready for job
Job ID: 21765749
Submitted from: /pscratch/sd/d/dmargala/work/cunumeric
Started on: Thu 15 Feb 2024 10:34:24 AM PST
Running on: nid[200413,200416]
Command: legate --logdir /pscratch/sd/d/dmargala/2024/02/15/103417 --launcher srun --cpus 1 --sysmem 4000 --gpus 4 --fbmem 36250 --verbose --log-to-file --nodes 2 --ranks-per-node 1 examples/gemm.py -n 60000
--- Legion Python Configuration ------------------------------------------------
Legate paths:
legate_dir : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
legate_build_dir : None
bind_sh_path : /pscratch/sd/d/dmargala/legate/bin/bind.sh
legate_lib_path : /pscratch/sd/d/dmargala/legate/lib
Legion paths:
legion_bin_path : /pscratch/sd/d/dmargala/legate/bin
legion_lib_path : /pscratch/sd/d/dmargala/legate/lib
realm_defines_h : /pscratch/sd/d/dmargala/legate/include/realm_defines.h
legion_defines_h : /pscratch/sd/d/dmargala/legate/include/legion_defines.h
legion_spy_py : /pscratch/sd/d/dmargala/legate/bin/legion_spy.py
legion_python : /pscratch/sd/d/dmargala/legate/bin/legion_python
legion_prof : /pscratch/sd/d/dmargala/legate/bin/legion_prof
legion_module : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
legion_jupyter_module : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
Versions:
legate_version : 24.01.00.dev+33.g1d0265c
Command:
srun -n 2 --ntasks-per-node 1 /pscratch/sd/d/dmargala/legate/bin/bind.sh --launcher srun -- /pscratch/sd/d/dmargala/legate/bin/legion_python -ll:py 1 -ll:gpu 4 -cuda:skipbusy -ll:util 2 -ll:bgwork 2 -ll:csize 4000 -ll:fsize 36250 -ll:zsize 32 -level openmp=5,gpu=5 -logfile /pscratch/sd/d/dmargala/2024/02/15/103417/legate_%.log -errlevel 4 -lg:eager_alloc_percentage 50 examples/gemm.py -n 60000
Customized Environment:
CUTENSOR_LOG_LEVEL=1
GASNET_MPI_THREAD=MPI_THREAD_MULTIPLE
LEGATE_MAX_DIM=4
LEGATE_MAX_FIELDS=256
LEGATE_NEED_CUDA=1
LEGATE_NEED_NETWORK=1
NCCL_LAUNCH_MODE=PARALLEL
PYTHONDONTWRITEBYTECODE=1
PYTHONPATH=/pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages:/pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
REALM_BACKTRACE=1
--------------------------------------------------------------------------------
[0 - 7f47a9e9b000] 2.473642 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 3600000000 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 188).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 7f47a9e9b000] 2.473667 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 0, process 626675 (thread 7f47a9e9b000) - obtaining backtrace
[1 - 7fdd7c0b1000] 2.473897 {5}{cunumeric.mapper}: Mapper cunumeric on Node 1 failed to allocate 3600000000 bytes on memory 1e00010000000004 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 189).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[1 - 7fdd7c0b1000] 2.473923 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 1, process 1705264 (thread 7fdd7c0b1000) - obtaining backtrace
Signal 6 received by process 626675 (thread 7f47a9e9b000) at: stack trace: 17 frames
[0] = raise at unknown file:0 [00007f47ba306d2b]
[1] = abort at unknown file:0 [00007f47ba3083e4]
[2] = legate::mapping::BaseMapper::report_failed_mapping(Legion::Mappable const&, unsigned int, Realm::Memory, int, unsigned long) [clone .cold] at unknown file:0 [00007f474db5d69e]
[3] = legate::mapping::BaseMapper::map_legate_store(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, legate::mapping::StoreMapping const&, std::set<Legion::RegionRequirement const*, std::less<Legion::RegionRequirement const*>, std::allocator<Legion::RegionRequirement const*> > const&, Realm::Processor, Legion::Mapping::PhysicalInstance&, bool) at unknown file:0 [00007f474db75d30]
[4] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&)::{lambda(bool)#1}::operator()(bool) const at unknown file:0 [00007f474db7605f]
[5] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&) at unknown file:0 [00007f474db766f5]
[6] = legate::mapping::BaseMapper::map_task(Legion::Internal::MappingCallInfo*, Legion::Task const&, Legion::Mapping::Mapper::MapTaskInput const&, Legion::Mapping::Mapper::MapTaskOutput&) at unknown file:0 [00007f474db7722c]
[7] = Legion::Internal::MapperManager::invoke_map_task(Legion::Internal::TaskOp*, Legion::Mapping::Mapper::MapTaskInput*, Legion::Mapping::Mapper::MapTaskOutput*, Legion::Internal::MappingCallInfo*) at unknown file:0 [00007f47bf1c5be7]
[8] = Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*) at unknown file:0 [00007f47bf12509e]
[9] = Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007f47bf12bbbc]
[10] = Legion::Internal::PointTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007f47bf12c0d3]
[11] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007f47bf2d2e96]
[12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007f47bd944cb8]
[13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007f47bd944d55]
[14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007f47bd9432d6]
[15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007f47bd949ae9]
[16] = unknown symbol at unknown file:0 [00007f47ba31d73d]
Signal 6 received by process 1705264 (thread 7fdd7c0b1000) at: stack trace: 17 frames
[0] = raise at unknown file:0 [00007fdd8bd06d2b]
[1] = abort at unknown file:0 [00007fdd8bd083e4]
[2] = legate::mapping::BaseMapper::report_failed_mapping(Legion::Mappable const&, unsigned int, Realm::Memory, int, unsigned long) [clone .cold] at unknown file:0 [00007fdd3d57569e]
[3] = legate::mapping::BaseMapper::map_legate_store(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, legate::mapping::StoreMapping const&, std::set<Legion::RegionRequirement const*, std::less<Legion::RegionRequirement const*>, std::allocator<Legion::RegionRequirement const*> > const&, Realm::Processor, Legion::Mapping::PhysicalInstance&, bool) at unknown file:0 [00007fdd3d58dd30]
[4] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&)::{lambda(bool)#1}::operator()(bool) const at unknown file:0 [00007fdd3d58e05f]
[5] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&) at unknown file:0 [00007fdd3d58e6f5]
[6] = legate::mapping::BaseMapper::map_task(Legion::Internal::MappingCallInfo*, Legion::Task const&, Legion::Mapping::Mapper::MapTaskInput const&, Legion::Mapping::Mapper::MapTaskOutput&) at unknown file:0 [00007fdd3d58f22c]
[7] = Legion::Internal::MapperManager::invoke_map_task(Legion::Internal::TaskOp*, Legion::Mapping::Mapper::MapTaskInput*, Legion::Mapping::Mapper::MapTaskOutput*, Legion::Internal::MappingCallInfo*) at unknown file:0 [00007fdd90b7abe7]
[8] = Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*) at unknown file:0 [00007fdd90ada09e]
[9] = Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007fdd90ae0bbc]
[10] = Legion::Internal::PointTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007fdd90ae10d3]
[11] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fdd90c87e96]
[12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007fdd8f2f9cb8]
[13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007fdd8f2f9d55]
[14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007fdd8f2f82d6]
[15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007fdd8f2feae9]
[16] = unknown symbol at unknown file:0 [00007fdd8bd1d73d]
srun: error: nid200413: task 0: Exited with exit code 1
srun: Terminating StepId=21765749.0
srun: error: nid200416: task 1: Exited with exit code 1
Command completed on: Thu 15 Feb 2024 10:34:44 AM PST
Job finished: Thu 15 Feb 2024 10:34:44 AM PST
salloc: Relinquishing job allocation 21765749
salloc: Job allocation 21765749 has been revoked.
Software versions
Jupyter notebook / Jupyter Lab version
No response
Expected behavior
I'm trying to run a cunumeric example that uses arrays that do not fit into memory of a single GPU. I'm starting with the gemm example which runs without error for problem sizes below the memory limit of a single GPU but it seems to fail when I try to scale beyond a single GPU. For example, this seems to work fine (frame buffer memory size of 36250):
Observed behavior
Increasing the problem size results in an error such as:
Example code or instructions
I've set up my environment using the nv-legate/quickstart recipe for Perlmutter. I'm also using the quickstart run script to run. For example:
Stack traceback or browser console output