Give user the option to offload hidden/target layers to GPU for speed-up. For the 70B case, I OOM'd at 70 layers. At 60 layers it used the 2nd GPU occasionally ("!! Out of memory (H), moving to device 1"), i.e. you will need more than 24 GB VRAM to do 60 layer offloading if you are doing 8192 context size.
This includes the move of .item() commit from #455. Simpler than #456, but has a constant VRAM overhead. It defaults to 0 offloading though, so if people don't touch it, nothing should change.
Give user the option to offload hidden/target layers to GPU for speed-up. For the 70B case, I OOM'd at 70 layers. At 60 layers it used the 2nd GPU occasionally ("!! Out of memory (H), moving to device 1"), i.e. you will need more than 24 GB VRAM to do 60 layer offloading if you are doing 8192 context size.
This includes the move of
.item()
commit from #455. Simpler than #456, but has a constant VRAM overhead. It defaults to 0 offloading though, so if people don't touch it, nothing should change.