Potentially Misbehaving GPU

hickscorp commented 7 months ago

This issue might stem from this one - https://github.com/elixir-nx/xla/issues/80 - where the lower level setup isn't right.

It seems to me that something is misbehaving. As per an advice I received, I thought I could see how Livebook behaves on the GPU. I configured container spec for NVIDIA along with docker, and am running something like this:

docker run -p 8080:8080 -p 8081:8081 --pull always ghcr.io/livebook-dev/livebook:latest-cuda12.1

I'm trying to use "out of the box" Smart Cells that involve Nx / Bumblebee / XLA. But as soon as I try to run one, the GPU goes OOM and it seems to me that it could take much more than that (It has around 6GB of memory). I've tried various options (Eg model backend: {EXLA.Backend, client: :host} along with different combinations of defn_options: [compiler: EXLA, lazy_transfers: :always] but still it just seems to postpone the crash (When the inference runs vs when the model loads).

Error looks like this:

** (RuntimeError) Out of memory while trying to allocate 117964800 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:  112.50MiB
              constant allocation:         0B
        maybe_live_out allocation:  112.50MiB
     preallocated temp allocation:         0B
                 total allocation:  225.00MiB
              total fragmentation:         0B (0.00%)
Peak buffers:
    Buffer 1:
        Size: 112.50MiB
        Entry Parameter Subshape: f32[29491200]
        ==========================

    Buffer 2:
        Size: 112.50MiB
        XLA Label: copy
        Shape: f32[1280,2560,3,3]
        ==========================

    Buffer 3:
        Size: 8B
        XLA Label: tuple
        Shape: (f32[1280,2560,3,3])
        ==========================

    (exla 0.6.4) lib/exla/executable.ex:56: EXLA.Executable.unwrap!/1
    (exla 0.6.4) lib/exla/executable.ex:19: EXLA.Executable.run/3
    (exla 0.6.4) lib/exla/defn.ex:346: EXLA.Defn.maybe_outfeed/7
    (stdlib 5.1.1) timer.erl:270: :timer.tc/2
    (exla 0.6.4) lib/exla/defn.ex:283: anonymous fn/7 in EXLA.Defn.__compile__/4
    (nx 0.6.4) lib/nx/defn.ex:443: Nx.Defn.do_jit_apply/3
    (bumblebee 0.4.2) lib/bumblebee/conversion/pytorch/loader.ex:79: Bumblebee.Conversion.PyTorch.Loader.object_resolver/1
    #cell:hacnfhjbfkcplhp3:6: (file)

The GPU in this laptop is as follows (definetely not a desktop GPU, but still high end for a laptop):

Mon Mar 18 18:42:40 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8              4W /   55W |    5412MiB /   6141MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    589608      C   ...lib/erlang/erts-14.1.1/bin/beam.smp       5406MiB |
+-----------------------------------------------------------------------------------------+

Is there any way for me to reliably know if my setup is right (Make sure that the GPU is indeed undersized, or see if something is inherently wrong with the lower level setup)? For example a livebook having parameters known to work within spec of this GPU?

cheshire commented 7 months ago

The GPU has 6GBs, but OOM happens at few hundred MBs. Something is off, maybe reboot? Here your screenshot also shows 5GB taken by something else.

hickscorp commented 7 months ago

As soon as XLA allocates, it goes to 5gb upfront. I'll reboot and try again. I thought something was off yes. Thanks!

On Mon, 18 Mar 2024, 19:52 George Karpenkov, @.***> wrote:

The GPU has 6GBs, but OOM happens at few hundred MBs. Something is off, maybe reboot? Here your screenshot also shows 5GB taken by something else.

— Reply to this email directly, view it on GitHub https://github.com/openxla/xla/issues/10660#issuecomment-2004688535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGI5FN4MOEROUVRW26BE2DYY4ZVJAVCNFSM6AAAAABE4BSWICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGY4DQNJTGU . You are receiving this because you authored the thread.Message ID: @.***>

hickscorp commented 7 months ago

@cheshire confirmed - after a reboot same symptoms. As soon as I start a playbook with XLA related code, the GPU reserves 5406MB of memory.

I'm aware that this might very well not be XLA related... But given the knowledge I've seen around here, I'd still like to ask :)

|============================================================| 100% (3438.35 MB)

20:42:13.460 [warning] Allocator (GPU_0_bfc) ran out of memory trying to allocate 6.25MiB (rounded to 6553600)requested by op 

20:42:13.463 [info] BFCAllocator dump for GPU_0_bfc

20:42:13.463 [info] Bin (256):  Total Chunks: 1, Chunks in use: 1. 256B allocated for chunks. 256B in use in bin. 16B client-requested in use in bin.

20:42:13.463 [info] Bin (512):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

20:42:13.463 [info] Bin (1024):     Total Chunks: 59, Chunks in use: 59. 73.8KiB allocated for chunks. 73.8KiB in use in bin. 73.8KiB client-requested in use in bin.

20:42:13.463 [info] Bin (2048):     Total Chunks: 176, Chunks in use: 176. 500.0KiB allocated for chunks. 500.0KiB in use in bin. 500.0KiB client-requested in use in bin.

** (RuntimeError) Out of memory while trying to allocate 6553600 bytes.
    (exla 0.6.4) lib/exla/device_buffer.ex:55: EXLA.DeviceBuffer.unwrap!/1
    (exla 0.6.4) lib/exla/device_buffer.ex:22: EXLA.DeviceBuffer.place_on_device/4
    (exla 0.6.4) lib/exla/backend.ex:46: EXLA.Backend.from_binary/3
    (bumblebee 0.4.2) lib/bumblebee/conversion/pytorch/loader.ex:79: Bumblebee.Conversion.PyTorch.Loader.object_resolver/1
    (unpickler 0.1.0) lib/unpickler.ex:828: Unpickler.resolve_object/2
    (unpickler 0.1.0) lib/unpickler.ex:818: anonymous fn/2 in Unpickler.finalize_stack_items/2
    (elixir 1.15.7) lib/map.ex:957: Map.get_and_update/3
    #cell:htvvgf7xqnlxekxj:6: (file)

20:42:13.770 [info] Total bytes in pool: 5566404864 memory_limit_: 5566405017 available bytes: 153 curr_region_allocation_bytes_: 11132810240

20:42:13.770 [info] Stats: 
Limit:                      5566405017
InUse:                      5486330368
MaxInUse:                   5486330368
NumAllocs:                         991
MaxAllocSize:                151781376
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

An interesting bit here... If I play a bit in particular with defn_options: [compiler: EXLA, lazy_transfers: :always], my computer runs out of RAM upfront. But we're talking nearly 16GB of RAM + 50GB of swap.

hickscorp commented 7 months ago

@cheshire ok so I rebooted, and tried again. It behaves the same on the host (Fedora) as it does in a cuda-enabled docker image.

I will put a screenshot here because that's from the Livebook and I can't copy-paste... What's really weird is the upfront allocation of this particular size:

Do you think the drivers on the host are misbehaving?

hickscorp commented 7 months ago

Well actually - it seems to be behaving slightly differently now... I'm getting:

10:42:32.799 [error] Process #PID<0.3493.0> on node :"dl53oscu-livebook_server@41cae5e4832c" raised an exception
** (Axon.CompileError) exception found when compiling layer Axon.Layers.embedding/3 named decoder_embedder.position_embedding:

    ** (ArgumentError) indices must be an integer tensor, got {:f, 32}
        (nx 0.6.4) lib/nx.ex:14150: Nx.take/3

(pass debug: true to build/compile see where the layer was defined)

Compiling of the model was initiated at:

    (bumblebee 0.4.2) lib/bumblebee/text/generation.ex:488: Bumblebee.Text.Generation."__defn:greedy_step__"/10
    (bumblebee 0.4.2) lib/bumblebee/text/generation.ex:434: anonymous fn/9 in Bumblebee.Text.Generation."__defn:greedy__"/7
    (nx 0.6.4) lib/nx/defn/expr.ex:517: Nx.Defn.Expr.while_vectorized/7
    (bumblebee 0.4.2) lib/bumblebee/text/generation.ex:431: Bumblebee.Text.Generation."__defn:greedy__"/7
    (nx 0.6.4) lib/nx/defn/compiler.ex:158: Nx.Defn.Compiler.runtime_fun/3
    (exla 0.6.4) lib/exla/defn.ex:387: anonymous fn/4 in EXLA.Defn.compile/8

Trying a text summary with BART.

cheshire commented 7 months ago

As soon as XLA allocates, it goes to 5gb upfront.

Yes, it's expected, the BFC allocator gets all the memory. But then it's failing to allocate a few hundred MBs, so something is interfering. Try to run with TF_CPP_VMODULE=bfc_allocator=5 TF_CPP_MIN_LOG_LEVEL=0 to see what's going on. Maybe you're trying to run two XLA processes, and the previous one gets all the memory?

hickscorp commented 7 months ago

Ok so I'm trying now.

TF_CPP_VMODULE=bfc_allocator=5 TF_CPP_MIN_LOG_LEVEL=0 iex -S mix phx.server will do? No, no duplicate XLA processes... Well I don't know. Unless VSCode plugins / code servers start my application but I doubt it.

hickscorp commented 7 months ago

Still crashes - but not crashing my terminal and killing some of my apps. It's just failing somewhere, and my supervision tree kicks-in and restarts it.

Logs are too much to attach here... @cheshire could you on top of your head recommend a model that would be "known to work" with this kind of GPU? For example, a small flan_t5 works fine - but it really should. What could be something somewhat "beefier" that I could try, to see if something is wrong?

Thanks a lot!

openxla / xla

Potentially Misbehaving GPU #10660