nv-legate / legate.core

The Foundation for All Legate Libraries
https://docs.nvidia.com/legate/24.06/
Apache License 2.0
185 stars 61 forks source link

Legate doesn't recognize/use GPUs on other nodes #949

Open s769 opened 2 months ago

s769 commented 2 months ago

I am trying to run the cunumeric cholesky.py example on multiple nodes. Each node has 3 A100 40GB GPUs. I was running into some out-of-memory errors, so I first tried the following test script (called it memtest.py) to see how memory was being allocated.

import cunumeric as np
import sys

n = int(sys.argv[1])
A = np.eye(n)

I ran this script with legate --nodes <num_nodes> --gpus 3 --fbmem 38000 --eager-alloc-percentage 1 --mem-usage ./memtest.py <n>. Here is the output for n = 121000 and num_nodes = 1

(legate-ucx) c309-004.ls6(1055)$ legate --nodes 1 --gpus 3 --fbmem 38000 --eager-alloc-percentage 1 --mem-usage ./memtes
t.py 121000
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            c309-004
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 41686

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   c309-004
  Local device: mlx5_0
--------------------------------------------------------------------------
[0 - 14acf432e000]    1.077806 {3}{legate.core.mapper}: legate.core used 39043312000 bytes of Framebuffer memory for one GPU and all its SMs memory 1e00000000000002 with 39845888000 total bytes (98%)
[0 - 14acf432e000]    1.077848 {3}{legate.core.mapper}: legate.core used 39043312000 bytes of Framebuffer memory for one GPU and all its SMs memory 1e00000000000003 with 39845888000 total bytes (98%)
[0 - 14acf432e000]    1.077852 {3}{legate.core.mapper}: legate.core used 39041376000 bytes of Framebuffer memory for one GPU and all its SMs memory 1e00000000000004 with 39845888000 total bytes (98%)
[0 - 14acf432e000]    1.077858 {3}{cunumeric.mapper}: cunumeric used 39043312000 bytes of Framebuffer memory for one GPU and all its SMs memory 1e00000000000002 with 39845888000 total bytes (98%)
[0 - 14acf432e000]    1.077861 {3}{cunumeric.mapper}: cunumeric used 39043312000 bytes of Framebuffer memory for one GPU and all its SMs memory 1e00000000000003 with 39845888000 total bytes (98%)
[0 - 14acf432e000]    1.077863 {3}{cunumeric.mapper}: cunumeric used 39041376000 bytes of Framebuffer memory for one GPU and all its SMs memory 1e00000000000004 with 39845888000 total bytes (98%)

This makes sense; The total memory usage corresponds to the size of the matrix. Now, if I run it with num_nodes=2, I get the same output. I would assume that if the number of nodes (hence GPUs) was doubled, the memory usage of each GPU would be halved.

Is there something wrong with how I am running the program?

Also, what does the --eager-alloc-percentage actually do? I observed that if you make it higher, the program throws an out-of-memory error for smaller values of the matrix size. Is it OK to always keep this value at 1? Is 0 an allowed value?

Any help is appreciated.

Legate version and info:

(legate-ucx) c309-004.ls6(1058)$ legate --info
Legate build configuration:
  build_type : Release
  use_openmp : True
  use_cuda   : True
  networks   : ucx
  conduit    :

(legate-ucx) c309-004.ls6(1059)$ legate --version
24.01.00.dev+38.g90944d7
manopapad commented 1 month ago

Now, if I run it with num_nodes=2, I get the same output. I would assume that if the number of nodes (hence GPUs) was doubled, the memory usage of each GPU would be halved.

Running on latest 24.06 release, with some additional debugging output, I can confirm that this is indeed the case (I'm testing on a machine with 2 GPUs, running 1 vs 2 ranks, with 1 GPU per rank):

$ LEGATE_LOG_MAPPING=1 legate --logging cunumeric.mapper=1 --gpus 1 --fbmem 20000 949.py 10000
...
[0 - 7f158243d000]    0.824425 {2}{cunumeric.mapper}: MAP_TASK for cunumeric::EyeTask<12> @ 949.py:5
[0 - 7f158243d000]    0.824446 {2}{cunumeric.mapper}:   TARGET PROCS: 1d00000000000007
[0 - 7f158243d000]    0.824453 {2}{cunumeric.mapper}:   CHOSEN INSTANCES:
[0 - 7f158243d000]    0.824459 {2}{cunumeric.mapper}:     Requirement[0](privilege=READ_WRITE,region=(1,(1,1),1),domain=<0,0>..<9999,9999>,fields=10000)
[0 - 7f158243d000]    0.824465 {2}{cunumeric.mapper}:       Instance[4000000000800001](region=(1,*,1),memory=1e00000000000002,domain=<0,0>..<9999,9999>,fields=10000,layout=SoA:XY)
...
$ LEGATE_LOG_MAPPING=1 legate --ranks-per-node 2 --logging cunumeric.mapper=1 --gpus 1 --fbmem 20000 --gpu-bind 0/1 --launcher mpirun 949.py 10000
...
legate --ranks-per-node 2 --gpus 1 --fbmem 20000 --gpu-bind 0/1 --launcher mpirun 949.py 10000
...
[0 - 7dca0008d000]    4.764809 {2}{cunumeric.mapper}: MAP_TASK for cunumeric::EyeTask(index_point=(0,0))<40> @ 949.py:5
[0 - 7dca0008d000]    4.764829 {2}{cunumeric.mapper}:   TARGET PROCS: 1d00000000000007
[0 - 7dca0008d000]    4.764836 {2}{cunumeric.mapper}:   CHOSEN INSTANCES:
[0 - 7dca0008d000]    4.764842 {2}{cunumeric.mapper}:     Requirement[0](privilege=READ_WRITE,region=(2,(8,2),2),domain=<0,0>..<9999,4999>,fields=10000)
[0 - 7dca0008d000]    4.764848 {2}{cunumeric.mapper}:       Instance[4000000000800001](region=(2,*,2),memory=1e00000000000002,domain=<0,0>..<9999,4999>,fields=10000,layout=SoA:XY)
[1 - 7dddd0016000]    4.841141 {2}{cunumeric.mapper}: MAP_TASK for cunumeric::EyeTask(index_point=(0,1))<37> @ 949.py:5
[1 - 7dddd0016000]    4.841162 {2}{cunumeric.mapper}:   TARGET PROCS: 1d00010000000007
[1 - 7dddd0016000]    4.841169 {2}{cunumeric.mapper}:   CHOSEN INSTANCES:
[1 - 7dddd0016000]    4.841176 {2}{cunumeric.mapper}:     Requirement[0](privilege=READ_WRITE,region=(2,(5,2),2),domain=<0,5000>..<9999,9999>,fields=10000)
[1 - 7dddd0016000]    4.841182 {2}{cunumeric.mapper}:       Instance[4000400040800001](region=(2,*,2),memory=1e00010000000002,domain=<0,5000>..<9999,9999>,fields=10000,layout=SoA:XY)
...

In the first case the entire domain <0,0>..<9999,9999> is instantiated on one GPU, in the second case each GPU gets one half of the domain.

I'm pretty sure --mem-usage would show the same information ... if it were working. I put an item on our roadmap to fix this flag for an upcoming weekly release.

what does the --eager-alloc-percentage actually do

Currently, Legion (the underlying tech that Legate is based on) splits its memory reservations between two pools: the "deferred" pool, used for allocating objects with size that's known ahead of time, e.g. most cuNumeric ndarrays (whose shape is known at creation time, even if their content is the result of some computation), and the "eager" pool, used for allocating scratch memory for tasks, and arrays produced by operations whose output size is data-dependent, e.g. np.unique.

--eager-alloc-percentage controls the split between these two pools. A value of 1 (1% allocated to eager pool) is safe if the code isn't using operations like np.unique. Setting it to 0 isn't going to make a real difference, and might cause operations that need a small intermediate value to fail.

We are working to remove this separation of pools, so hopefully this flag won't be relevant in the near future.

s769 commented 1 month ago

Thank you for your reply. After some additional testing, I found something interesting:

Earlier, I was not passing the --launcher mpirun option explicitly as I thought it was the default. When I added that to my run command, the memory usage output seemed to work properly, and the --mem-usage flag was also showing this.

I am trying out the cholesky.py example code for cunumeric. When I run it for bigger matrices, I get this error

numpy.linalg.LinAlgError: Matrix is not positive definite

which I found to be weird since I believe the example uses the identity matrix as the test case. I'm not sure if this has to do with my value for --eager-alloc-percentage either.

As another note, the cholesky.py example shows the factorization but not the triangular solve. Is the np.linalg.solve method of cunumeric configured to use the factored matrix automatically?

(if the follow-up questions regarding cholesky.py are off-topic, I can make another thread for it).