add layer GPU offloading for hidden/target states

Give user the option to offload hidden/target layers to GPU for speed-up. For the 70B case, I OOM'd at 70 layers. At 60 layers it used the 2nd GPU occasionally ("!! Out of memory (H), moving to device 1"), i.e. you will need more than 24 GB VRAM to do 60 layer offloading if you are doing 8192 context size.

This includes the move of .item() commit from #455. Simpler than #456, but has a constant VRAM overhead. It defaults to 0 offloading though, so if people don't touch it, nothing should change.

8B model, -l 4096 -ml 4096 -r 100 -mr 100, -hsol 0

--------------------------------------------
| Measured: model.layers.0 (Attention)     |
| Duration: 49.56 seconds                  |
| Completed step: 1/67                     |
| Avg time / step (rolling): 49.56 seconds |
| Estimated remaining time: 54min 30sec    |
| Last checkpoint layer: None              |
--------------------------------------------

--------------------------------------------
| Measured: model.layers.0 (MLP)           |
| Duration: 83.18 seconds                  |
| Completed step: 2/67                     |
| Avg time / step (rolling): 66.37 seconds |
| Estimated remaining time: 71min 54sec    |
| Last checkpoint layer: None              |
--------------------------------------------

-hsol 50 (50%)

--------------------------------------------
| Measured: model.layers.0 (Attention)     |
| Duration: 35.37 seconds                  |
| Completed step: 1/67                     |
| Avg time / step (rolling): 35.37 seconds |
| Estimated remaining time: 38min 54sec    |
| Last checkpoint layer: None              |
--------------------------------------------

--------------------------------------------
| Measured: model.layers.0 (MLP)           |
| Duration: 69.26 seconds                  |
| Completed step: 2/67                     |
| Avg time / step (rolling): 52.31 seconds |
| Estimated remaining time: 56min 40sec    |
| Last checkpoint layer: None              |
--------------------------------------------

-hsol 100 (100%)

--------------------------------------------
| Measured: model.layers.0 (Attention)     |
| Duration: 21.17 seconds                  |
| Completed step: 1/67                     |
| Avg time / step (rolling): 21.17 seconds |
| Estimated remaining time: 23min 16sec    |
| Last checkpoint layer: None              |
--------------------------------------------

--------------------------------------------
| Measured: model.layers.0 (MLP)           |
| Duration: 56.44 seconds                  |
| Completed step: 2/67                     |
| Avg time / step (rolling): 38.80 seconds |
| Estimated remaining time: 42min 2sec     |
| Last checkpoint layer: None              |
--------------------------------------------

70B model, -l 8192 -ml 8192 -r 200 -mr 200, -hsol 0

---------------------------------------------
| Measured: model.layers.0 (Attention)      |
| Duration: 477.28 seconds                  |
| Completed step: 1/163                     |
| Avg time / step (rolling): 477.28 seconds |
| Estimated remaining time: 1288min 39sec   |
| Last checkpoint layer: None               |
---------------------------------------------

-----------------------------------------------------
| Measured: model.layers.0 (MLP)                    |
| Duration: 904.89 seconds                          |
| Completed step: 2/163                             |
| Avg time / step (rolling): 691.09 seconds         |
| Estimated remaining time: 1854min 25sec           |
| Last checkpoint layer: model.layers.0 (Attention) |
-----------------------------------------------------

-hsol 60

---------------------------------------------
| Measured: model.layers.0 (Attention)      |
| Duration: 388.60 seconds                  |
| Completed step: 1/163                     |
| Avg time / step (rolling): 388.60 seconds |
| Estimated remaining time: 1049min 13sec   |
| Last checkpoint layer: None               |
---------------------------------------------

-----------------------------------------------------
| Measured: model.layers.0 (MLP)                    |
| Duration: 841.63 seconds                          |
| Completed step: 2/163                             |
| Avg time / step (rolling): 615.12 seconds         |
| Estimated remaining time: 1650min 34sec           |
| Last checkpoint layer: model.layers.0 (Attention) |
-----------------------------------------------------

turboderp / exllamav2

add layer GPU offloading for hidden/target states #462