Open luweizheng opened 1 year ago
Updates: I increase the sysmem and make the program run successfully.
If I want to start 6 nodes and each node has 64 cores.
I use intel mpi. In the following example, how to set the cpus
if I launch a mpirun
job? Is the following command --cpus=32
correct or should I set --cpus=1
mpirun -np 384 singularity run --bind /usr/local/nvidia --nv legate.simg legate --cpus 32 --sysmem 160000 leg.py --n 20000
Thank you for the comments. We are in the process of cleaning up our documentation, and your feedback is very informative.
I am using 23.03.00+3.g5de57a8 installed by conda. I guess it is pre-built by the dev team.
I suggest using the latest released version, 23.07.
the --cpus is the number of CPU cores?
It is the number of "physical" CPU cores to reserve on each rank for Legate's use. "Physical" means ignoring SMT (simultaneous multi-threading). Currently most systems use SMT=2, so the number of "virtual" cores (which is what most tools report, e.g. htop
) will be double the number of physical cores. This is a quirk of the underlying Realm system.
What does the "per rank" means? Is it just the same as a mpi program that have multiple ranks?
That is correct. We could rephrase the documentation for these options to talk about "CPUs per process" rather than "CPUs per rank", which is a more esoteric term. We typically suggest using the default mode of 1 rank per node.
How to set --eager-alloc-percentage, 10 or 10%
Just 10, the % is implied.
I guess I should use --numamem
--numamem
is relevant only if you're using OpenMP (through --omps
and --ompthreads
). For --cpus
the relevant setting is --sysmem
. You may want to try using OpenMP to see if it improves performance over using CPUs separately.
If I want to start 6 nodes and each node has 64 cores. I use intel mpi. In the following example, how to set the cpus if I launch a mpirun job? Is the following command --cpus=32 correct or should I set --cpus=1
You're doing it correctly (including the quirkiness with physical vs virtual cores), but you need to tell Legate this is a multi-node execution, otherwise you'll just get one copy of the program executing independently on each node (instead of all executing as part of the same execution):
mpirun -np 384 singularity run --bind /usr/local/nvidia --nv legate.simg legate --nodes 6 --cpus 32 --sysmem 160000 leg.py --n 20000
I will open a discussion on automatically detecting that Legate was launched externally, like in your example, so you don't have to set --nodes
manually in this scenario.
Even better, let Legate deal with the parallel launch automatically (you can also add --verbose
to confirm that the final command is correct).
legate --nodes 6 --cpus 32 --sysmem 160000 --launcher mpirun --wrapper="singularity run --bind /usr/local/nvidia --nv legate.simg" leg.py --n 20000
Thankyou so much! @manopapad
Another question.
As showed in my example, I use singularity container. Is it the right way to start the multiple-node program? Because legate
is inside the singularity container. I build the container using https://github.com/nv-legate/quickstart. It seems that this container is using openmpi.
mpirun -np 384 singularity run --bind /usr/local/nvidia --nv legate.simg legate --nodes 6 --cpus 32 --sysmem 160000 leg.py --n 20000
If will be really helpful if there is an official docker or singularity container. It is indeed hard for non-dev team to build this project.
Or the conda channel provide multi-node pre-built program.
Is it the right way to start the multiple-node program?
I suggest you start with a 1-rank-per-node configuration. Therefore try either:
mpirun -n 6 --npernode 1 singularity run --bind /usr/local/nvidia --nv legate.simg legate --nodes 6 --cpus 32 --sysmem 160000 leg.py --n 20000
or:
legate --nodes 6 --cpus 32 --sysmem 160000 --launcher mpirun --wrapper "singularity run --bind /usr/local/nvidia --nv legate.simg" leg.py --n 20000
If will be really helpful if there is an official docker or singularity container. It is indeed hard for non-dev team to build this project. Or the conda channel provide multi-node pre-built program.
The current priority is to have multi-node-capable conda packages for one of the upcoming releases (using the ucx
package from conda-forge for networking).
@manopapad Hi
Thanks a lot!
The following command can start a multi-node program.
mpirun -n 6 --npernode 1 singularity run --bind /usr/local/nvidia --nv legate.simg legate --nodes 6 --cpus 32 --sysmem 160000 leg.py --n 20000
But I get the following error.
Signal 6 received by process 20448 (thread 7fde78f12700) at: stack trace: 17 frames
[0] = raise at unknown file:0 [00007fde8c2fbe87]
[1] = abort at unknown file:0 [00007fde8c2fd7f0]
[2] = Legion::Internal::Runtime::report_error_message(int, char const*, int, char const*) at unknown file:0 [00007fde8e3416c5]
[3] = Legion::Internal::ShardingFunction::find_owner(Legion::DomainPoint const&, Legion::Domain const&) at unknown file:0 [00007fde8e34553c]
[4] = Legion::Internal::IndexSpaceNodeT<2, long long>::create_shard_space(Legion::Internal::ShardingFunction*, unsigned int, Legion::IndexSpace, Legion::Domain const&, std::vector<Legion::DomainPoint, std::allocator<Legion::DomainPoint> > const&, Legion::Internal::Provenance*) at unknown file:0 [00007fde8e595440]
[5] = Legion::Internal::ShardingFunction::find_shard_space(unsigned int, Legion::Internal::IndexSpaceNode*, Legion::IndexSpace, Legion::Internal::Provenance*) at unknown file:0 [00007fde8e37a29a]
[6] = Legion::Internal::ReplIndexTask::trigger_ready() at unknown file:0 [00007fde8e1be312]
[7] = Legion::Internal::Memoizable<Legion::Internal::ReplIndexTask>::trigger_ready() at unknown file:0 [00007fde8e4015b1]
[8] = Legion::Internal::InnerContext::process_ready_queue() at unknown file:0 [00007fde8e096368]
[9] = Legion::Internal::InnerContext::handle_ready_queue(void const*) at unknown file:0 [00007fde8e09640c]
[10] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fde8e3ca0df]
[11] = Realm::LocalTaskProcessor::execute_task(unsigned int, Realm::ByteArrayRef const&) at unknown file:0 [00007fde8cb2f600]
[12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007fde8cb7bcc2]
[13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007fde8cb7bec5]
[14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007fde8cb7a319]
[15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007fde8cb84111]
[16] = unknown symbol at unknown file:0 [00007fde8c31567f]
My python code:
import argparse
import time
import cunumeric as np
from legate.timing import time as leg_time
import time
def main():
parser = argparse.ArgumentParser(description="cholesky benchmark")
parser.add_argument("--n", type=int, default=10000, help="n")
args = parser.parse_args()
n = args.n
start = time.time()
start_leg = leg_time('s')
a = np.random.rand(n,n)
a = a.T @ a + n * np.eye(n)
l = np.linalg.cholesky(a=a)
end = time.time()
end_leg = leg_time('s')
print(f"time: {end - start}")
print(f"leg time: {end_leg - start_leg}")
if __name__ == "__main__":
main()
Let me see if we can reproduce this. Can you report the versions of relevant repositories that you're using (legate.core, cunumeric, legion if applicable)?
Also, can you share the full output? I see a frame in report_error_message
, therefore I would expect an error message to have been printed above this stacktrace.
@manopapad
I built the container using the quickstart dockerfile by:
CUDA_VER=11.7.1 PYTHON_VER=3.10 LINUX_VER=ubuntu18.04 NOPULL=1 ./make_image.sh
Then, I converted the docker container into a singularity container. Although built with cuda, I run on a CPU node, just like the command line I mentioned in previous threads. What's more, the host machine's OFED is 5.6. But it seems that I can successfully run single-node or multi-node but each node with the same process (no data chunking).
The full log is here legate.log.
I was able to reproduce this. It looks like it is a known issue, and should be fixed with https://github.com/nv-legate/legate.core/pull/819.
Hi @manopapad Are there any progress on a multi-node conda package in recent releases?
I expect multi-node conda packages to be ready around Q1 2024.
Hi,
I currently use legate and do some stress test to see the scalability. I find the doc (https://nv-legate.github.io/cunumeric/23.07/user/configuration.html#resource-allocation) is not very clear on how to specify cpus and memory.
I am using 23.03.00+3.g5de57a8 installed by conda. I guess it is pre-built by the dev team.
the
--cpus
is the number of CPU cores? What does the "per rank" means? Is it just the same as a mpi program that have multiple ranks?If I increase the input size of a cholesky decomposition program, I get the following error:
The numpy program can handle program with the same size. How to set
--eager-alloc-percentage
, 10 or 10%? I guess I should use--numamem
? There is little information on how to config these parameters. What do they mean.Thanks a lot if more docs are added.