Open dcaballe opened 7 months ago
A small note: _initializer_57_dispatch_0
packs tensor<256000x3072xf32>
into tensor<32000x3072x8x1xf32>
. This already takes ~6GB ram by itself during packing
yeah, it's critical that we get either compile-time or deploy-time packing implemented - may be able to skate by with small models (inefficiently), but doesn't work on big ones.
It looks like Gemma can only run on Pixel 8 if we use the non-DT path for now. Using DT or DT+UK leads to a SIGABRT signal which is probably due to running OOM. This is the info I can see in the tombstone:
The backtrace points to
_initializer_57_dispatch_0_pack_f32
, which looks like this (%5
looks like a hugef32
allocation):Just a few ideas: we may want to look at what we are packing here and if we end up duplicating the same tensors with different layouts. The fact that the dispatch is
dispatch_0
suggests that we might be packing constants/inputs so perhaps there is a missing hoisting or some input preprocessing that we can do to reduce the memory footprint.To repro:
https://discord.com/channels/689900678990135345/1146173056537079919/1212949110718730260
.