nv-legate / cunumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cunumeric/24.06/
Apache License 2.0
622 stars 70 forks source link

More doc on cpus and mem #1039

Open luweizheng opened 1 year ago

luweizheng commented 1 year ago

Hi,

I currently use legate and do some stress test to see the scalability. I find the doc (https://nv-legate.github.io/cunumeric/23.07/user/configuration.html#resource-allocation) is not very clear on how to specify cpus and memory.

I am using 23.03.00+3.g5de57a8 installed by conda. I guess it is pre-built by the dev team.

the --cpus is the number of CPU cores? What does the "per rank" means? Is it just the same as a mpi program that have multiple ranks?

If I increase the input size of a cholesky decomposition program, I get the following error:

[0 - 14d70fe9c700]    3.285578 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 100000000 bytes on memory 1e00000000000000 (of kind SYSTEM_MEM: Visible to all processors on a node) for region requirement 1 of Task cunumeric::MatMulTask[leg.py:17] (UID 436).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 14d70fe9c700]    3.286919 {5}{legate}: Legate called abort in /opt/conda/conda-bld/legate-core_1678881258206/work/src/core/mapping/base_mapper.cc at line 925 in function report_failed_mapping
Signal 6 received by node 0, process 52160 (thread 14d70fe9c700) - obtaining backtrace
Signal 6 received by process 52160 (thread 14d70fe9c700) at: stack trace: 18 frames

The numpy program can handle program with the same size. How to set --eager-alloc-percentage, 10 or 10%? I guess I should use --numamem ? There is little information on how to config these parameters. What do they mean.

Thanks a lot if more docs are added.

luweizheng commented 1 year ago

Updates: I increase the sysmem and make the program run successfully.

luweizheng commented 1 year ago

If I want to start 6 nodes and each node has 64 cores. I use intel mpi. In the following example, how to set the cpus if I launch a mpirun job? Is the following command --cpus=32 correct or should I set --cpus=1

mpirun -np 384 singularity run --bind /usr/local/nvidia --nv legate.simg legate --cpus 32 --sysmem 160000 leg.py --n 20000
manopapad commented 1 year ago

Thank you for the comments. We are in the process of cleaning up our documentation, and your feedback is very informative.

I am using 23.03.00+3.g5de57a8 installed by conda. I guess it is pre-built by the dev team.

I suggest using the latest released version, 23.07.

the --cpus is the number of CPU cores?

It is the number of "physical" CPU cores to reserve on each rank for Legate's use. "Physical" means ignoring SMT (simultaneous multi-threading). Currently most systems use SMT=2, so the number of "virtual" cores (which is what most tools report, e.g. htop) will be double the number of physical cores. This is a quirk of the underlying Realm system.

What does the "per rank" means? Is it just the same as a mpi program that have multiple ranks?

That is correct. We could rephrase the documentation for these options to talk about "CPUs per process" rather than "CPUs per rank", which is a more esoteric term. We typically suggest using the default mode of 1 rank per node.

How to set --eager-alloc-percentage, 10 or 10%

Just 10, the % is implied.

I guess I should use --numamem

--numamem is relevant only if you're using OpenMP (through --omps and --ompthreads). For --cpus the relevant setting is --sysmem. You may want to try using OpenMP to see if it improves performance over using CPUs separately.

If I want to start 6 nodes and each node has 64 cores. I use intel mpi. In the following example, how to set the cpus if I launch a mpirun job? Is the following command --cpus=32 correct or should I set --cpus=1

You're doing it correctly (including the quirkiness with physical vs virtual cores), but you need to tell Legate this is a multi-node execution, otherwise you'll just get one copy of the program executing independently on each node (instead of all executing as part of the same execution):

mpirun -np 384 singularity run --bind /usr/local/nvidia --nv legate.simg legate --nodes 6 --cpus 32 --sysmem 160000 leg.py --n 20000

I will open a discussion on automatically detecting that Legate was launched externally, like in your example, so you don't have to set --nodes manually in this scenario.

Even better, let Legate deal with the parallel launch automatically (you can also add --verbose to confirm that the final command is correct).

legate --nodes 6 --cpus 32 --sysmem 160000 --launcher mpirun --wrapper="singularity run --bind /usr/local/nvidia --nv legate.simg"  leg.py --n 20000
luweizheng commented 1 year ago

Thankyou so much! @manopapad

Another question.

As showed in my example, I use singularity container. Is it the right way to start the multiple-node program? Because legate is inside the singularity container. I build the container using https://github.com/nv-legate/quickstart. It seems that this container is using openmpi.

mpirun -np 384 singularity run --bind /usr/local/nvidia --nv legate.simg legate --nodes 6 --cpus 32 --sysmem 160000 leg.py --n 20000

If will be really helpful if there is an official docker or singularity container. It is indeed hard for non-dev team to build this project.

Or the conda channel provide multi-node pre-built program.

manopapad commented 1 year ago

Is it the right way to start the multiple-node program?

I suggest you start with a 1-rank-per-node configuration. Therefore try either:

mpirun -n 6 --npernode 1 singularity run --bind /usr/local/nvidia --nv legate.simg legate --nodes 6 --cpus 32 --sysmem 160000 leg.py --n 20000

or:

legate --nodes 6 --cpus 32 --sysmem 160000 --launcher mpirun --wrapper "singularity run --bind /usr/local/nvidia --nv legate.simg" leg.py --n 20000

If will be really helpful if there is an official docker or singularity container. It is indeed hard for non-dev team to build this project. Or the conda channel provide multi-node pre-built program.

The current priority is to have multi-node-capable conda packages for one of the upcoming releases (using the ucx package from conda-forge for networking).

luweizheng commented 1 year ago

@manopapad Hi

Thanks a lot!

The following command can start a multi-node program.

mpirun -n 6 --npernode 1 singularity run --bind /usr/local/nvidia --nv legate.simg legate --nodes 6 --cpus 32 --sysmem 160000 leg.py --n 20000

But I get the following error.

Signal 6 received by process 20448 (thread 7fde78f12700) at: stack trace: 17 frames
  [0] = raise at unknown file:0 [00007fde8c2fbe87]
  [1] = abort at unknown file:0 [00007fde8c2fd7f0]
  [2] = Legion::Internal::Runtime::report_error_message(int, char const*, int, char const*) at unknown file:0 [00007fde8e3416c5]
  [3] = Legion::Internal::ShardingFunction::find_owner(Legion::DomainPoint const&, Legion::Domain const&) at unknown file:0 [00007fde8e34553c]
  [4] = Legion::Internal::IndexSpaceNodeT<2, long long>::create_shard_space(Legion::Internal::ShardingFunction*, unsigned int, Legion::IndexSpace, Legion::Domain const&, std::vector<Legion::DomainPoint, std::allocator<Legion::DomainPoint> > const&, Legion::Internal::Provenance*) at unknown file:0 [00007fde8e595440]
  [5] = Legion::Internal::ShardingFunction::find_shard_space(unsigned int, Legion::Internal::IndexSpaceNode*, Legion::IndexSpace, Legion::Internal::Provenance*) at unknown file:0 [00007fde8e37a29a]
  [6] = Legion::Internal::ReplIndexTask::trigger_ready() at unknown file:0 [00007fde8e1be312]
  [7] = Legion::Internal::Memoizable<Legion::Internal::ReplIndexTask>::trigger_ready() at unknown file:0 [00007fde8e4015b1]
  [8] = Legion::Internal::InnerContext::process_ready_queue() at unknown file:0 [00007fde8e096368]
  [9] = Legion::Internal::InnerContext::handle_ready_queue(void const*) at unknown file:0 [00007fde8e09640c]
  [10] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fde8e3ca0df]
  [11] = Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) at unknown file:0 [00007fde8cb2f600]
  [12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007fde8cb7bcc2]
  [13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007fde8cb7bec5]
  [14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007fde8cb7a319]
  [15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007fde8cb84111]
  [16] = unknown symbol at unknown file:0 [00007fde8c31567f]

My python code:

import argparse
import time

import cunumeric as np
from legate.timing import time as leg_time
import time

def main():
    parser = argparse.ArgumentParser(description="cholesky benchmark")
    parser.add_argument("--n", type=int, default=10000, help="n")
    args = parser.parse_args()
    n = args.n

    start = time.time()
    start_leg = leg_time('s')
    a = np.random.rand(n,n)
    a = a.T @ a + n * np.eye(n)

    l = np.linalg.cholesky(a=a)

    end = time.time()
    end_leg = leg_time('s')
    print(f"time: {end - start}")
    print(f"leg time: {end_leg - start_leg}")

if __name__ == "__main__":
    main()
manopapad commented 1 year ago

Let me see if we can reproduce this. Can you report the versions of relevant repositories that you're using (legate.core, cunumeric, legion if applicable)?

Also, can you share the full output? I see a frame in report_error_message, therefore I would expect an error message to have been printed above this stacktrace.

luweizheng commented 1 year ago

@manopapad

I built the container using the quickstart dockerfile by:

CUDA_VER=11.7.1 PYTHON_VER=3.10 LINUX_VER=ubuntu18.04 NOPULL=1 ./make_image.sh 

Then, I converted the docker container into a singularity container. Although built with cuda, I run on a CPU node, just like the command line I mentioned in previous threads. What's more, the host machine's OFED is 5.6. But it seems that I can successfully run single-node or multi-node but each node with the same process (no data chunking).

The full log is here legate.log.

manopapad commented 1 year ago

I was able to reproduce this. It looks like it is a known issue, and should be fixed with https://github.com/nv-legate/legate.core/pull/819.

luweizheng commented 1 year ago

Hi @manopapad Are there any progress on a multi-node conda package in recent releases?

manopapad commented 1 year ago

I expect multi-node conda packages to be ready around Q1 2024.