Open hickscorp opened 7 months ago
The GPU has 6GBs, but OOM happens at few hundred MBs. Something is off, maybe reboot? Here your screenshot also shows 5GB taken by something else.
As soon as XLA allocates, it goes to 5gb upfront. I'll reboot and try again. I thought something was off yes. Thanks!
On Mon, 18 Mar 2024, 19:52 George Karpenkov, @.***> wrote:
The GPU has 6GBs, but OOM happens at few hundred MBs. Something is off, maybe reboot? Here your screenshot also shows 5GB taken by something else.
— Reply to this email directly, view it on GitHub https://github.com/openxla/xla/issues/10660#issuecomment-2004688535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGI5FN4MOEROUVRW26BE2DYY4ZVJAVCNFSM6AAAAABE4BSWICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGY4DQNJTGU . You are receiving this because you authored the thread.Message ID: @.***>
@cheshire confirmed - after a reboot same symptoms. As soon as I start a playbook with XLA related code, the GPU reserves 5406MB of memory.
I'm aware that this might very well not be XLA related... But given the knowledge I've seen around here, I'd still like to ask :)
|============================================================| 100% (3438.35 MB)
20:42:13.460 [warning] Allocator (GPU_0_bfc) ran out of memory trying to allocate 6.25MiB (rounded to 6553600)requested by op
20:42:13.463 [info] BFCAllocator dump for GPU_0_bfc
20:42:13.463 [info] Bin (256): Total Chunks: 1, Chunks in use: 1. 256B allocated for chunks. 256B in use in bin. 16B client-requested in use in bin.
20:42:13.463 [info] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
20:42:13.463 [info] Bin (1024): Total Chunks: 59, Chunks in use: 59. 73.8KiB allocated for chunks. 73.8KiB in use in bin. 73.8KiB client-requested in use in bin.
20:42:13.463 [info] Bin (2048): Total Chunks: 176, Chunks in use: 176. 500.0KiB allocated for chunks. 500.0KiB in use in bin. 500.0KiB client-requested in use in bin.
** (RuntimeError) Out of memory while trying to allocate 6553600 bytes.
(exla 0.6.4) lib/exla/device_buffer.ex:55: EXLA.DeviceBuffer.unwrap!/1
(exla 0.6.4) lib/exla/device_buffer.ex:22: EXLA.DeviceBuffer.place_on_device/4
(exla 0.6.4) lib/exla/backend.ex:46: EXLA.Backend.from_binary/3
(bumblebee 0.4.2) lib/bumblebee/conversion/pytorch/loader.ex:79: Bumblebee.Conversion.PyTorch.Loader.object_resolver/1
(unpickler 0.1.0) lib/unpickler.ex:828: Unpickler.resolve_object/2
(unpickler 0.1.0) lib/unpickler.ex:818: anonymous fn/2 in Unpickler.finalize_stack_items/2
(elixir 1.15.7) lib/map.ex:957: Map.get_and_update/3
#cell:htvvgf7xqnlxekxj:6: (file)
20:42:13.770 [info] Total bytes in pool: 5566404864 memory_limit_: 5566405017 available bytes: 153 curr_region_allocation_bytes_: 11132810240
20:42:13.770 [info] Stats:
Limit: 5566405017
InUse: 5486330368
MaxInUse: 5486330368
NumAllocs: 991
MaxAllocSize: 151781376
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
An interesting bit here...
If I play a bit in particular with defn_options: [compiler: EXLA, lazy_transfers: :always]
, my computer runs out of RAM upfront. But we're talking nearly 16GB of RAM + 50GB of swap.
@cheshire ok so I rebooted, and tried again. It behaves the same on the host (Fedora) as it does in a cuda-enabled docker image.
I will put a screenshot here because that's from the Livebook and I can't copy-paste... What's really weird is the upfront allocation of this particular size:
Do you think the drivers on the host are misbehaving?
Well actually - it seems to be behaving slightly differently now... I'm getting:
10:42:32.799 [error] Process #PID<0.3493.0> on node :"dl53oscu-livebook_server@41cae5e4832c" raised an exception
** (Axon.CompileError) exception found when compiling layer Axon.Layers.embedding/3 named decoder_embedder.position_embedding:
** (ArgumentError) indices must be an integer tensor, got {:f, 32}
(nx 0.6.4) lib/nx.ex:14150: Nx.take/3
(pass debug: true to build/compile see where the layer was defined)
Compiling of the model was initiated at:
(bumblebee 0.4.2) lib/bumblebee/text/generation.ex:488: Bumblebee.Text.Generation."__defn:greedy_step__"/10
(bumblebee 0.4.2) lib/bumblebee/text/generation.ex:434: anonymous fn/9 in Bumblebee.Text.Generation."__defn:greedy__"/7
(nx 0.6.4) lib/nx/defn/expr.ex:517: Nx.Defn.Expr.while_vectorized/7
(bumblebee 0.4.2) lib/bumblebee/text/generation.ex:431: Bumblebee.Text.Generation."__defn:greedy__"/7
(nx 0.6.4) lib/nx/defn/compiler.ex:158: Nx.Defn.Compiler.runtime_fun/3
(exla 0.6.4) lib/exla/defn.ex:387: anonymous fn/4 in EXLA.Defn.compile/8
Trying a text summary with BART.
As soon as XLA allocates, it goes to 5gb upfront.
Yes, it's expected, the BFC allocator gets all the memory. But then it's failing to allocate a few hundred MBs, so something is interfering. Try to run with TF_CPP_VMODULE=bfc_allocator=5 TF_CPP_MIN_LOG_LEVEL=0
to see what's going on. Maybe you're trying to run two XLA processes, and the previous one gets all the memory?
Ok so I'm trying now.
TF_CPP_VMODULE=bfc_allocator=5 TF_CPP_MIN_LOG_LEVEL=0 iex -S mix phx.server
will do?
No, no duplicate XLA processes... Well I don't know. Unless VSCode plugins / code servers start my application but I doubt it.
Still crashes - but not crashing my terminal and killing some of my apps. It's just failing somewhere, and my supervision tree kicks-in and restarts it.
Logs are too much to attach here... @cheshire could you on top of your head recommend a model that would be "known to work" with this kind of GPU? For example, a small flan_t5
works fine - but it really should. What could be something somewhat "beefier" that I could try, to see if something is wrong?
Thanks a lot!
This issue might stem from this one - https://github.com/elixir-nx/xla/issues/80 - where the lower level setup isn't right.
It seems to me that something is misbehaving. As per an advice I received, I thought I could see how Livebook behaves on the GPU. I configured container spec for NVIDIA along with docker, and am running something like this:
I'm trying to use "out of the box" Smart Cells that involve Nx / Bumblebee / XLA. But as soon as I try to run one, the GPU goes OOM and it seems to me that it could take much more than that (It has around 6GB of memory). I've tried various options (Eg model
backend: {EXLA.Backend, client: :host}
along with different combinations ofdefn_options: [compiler: EXLA, lazy_transfers: :always]
but still it just seems to postpone the crash (When the inference runs vs when the model loads).Error looks like this:
The GPU in this laptop is as follows (definetely not a desktop GPU, but still high end for a laptop):
Is there any way for me to reliably know if my setup is right (Make sure that the GPU is indeed undersized, or see if something is inherently wrong with the lower level setup)? For example a livebook having parameters known to work within spec of this GPU?