tracel-ai / cubecl

Multi-platform high-performance compute language extension for Rust.
https://burn.dev
Apache License 2.0
696 stars 36 forks source link

`MemoryManagement<Storage>::reserve()` can panic with index out of bounds #243

Open Quba1 opened 3 weeks ago

Quba1 commented 3 weeks ago

cubecl_runtime::memory_management::memory_manage::MemoryManagement<Storage>::reserve() (below) can panic when size is bigger than any available pool.

https://github.com/tracel-ai/cubecl/blob/3882ed25b47506d49562c501a179b7468e61702e/crates/cubecl-runtime/src/memory_management/memory_manage.rs#L275-L288

In my case (running burn) my program attempted to allocate 2_925_527_040 bytes with pools max sizes: [0] 4_028_128 B [1] 4_028_128 B [2] 32_225_024 [3] [4] 2_062_401_536 B. So pool_ind was computed as 5 and &mut self.pools[pool_ind] panicked.

The function can panic in edge-cases because per docs partition_point() requires that:

(...) all elements for which the predicate returns true are at the start of the slice and all elements for which the predicate returns false are at the end (...) If this slice is not partitioned, the returned result is unspecified and meaningless, as this method performs a kind of binary search.

and the condition seems to not be checked at any point.

The same issue also occurs a few lines lower in alloc().

The solution is probably a simple bounds check or capping pool_ind to pools.len() - 1 or some more complex error handling.

Happy to help with solving the problem.

nathanielsimard commented 2 weeks ago

Sure would love the help :)

ragyabraham commented 1 week ago

Yup, seeing this in actions when running training

=== PANIC ===
A fatal error happened, you can check the experiment logs here => '/tmp/guide/experiment.log'
=============
thread 'main' panicked at /Users/ragy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cubecl-runtime-0.3.0/src/memory_management/memory_manage.rs:283:35:
index out of bounds: the len is 6 but the index is 6