Fix multigpu - Githubissues

Fixes multigpu and adds a generation test.

Expected:

> CUDA_VISIBLE_DEVICES=0,1,2,3 python llama.py /media/ssd/LLaMA/65B-hf/llama-65b --wbits 4 --groupsize 128 --load /media/ssd/LLaMA/65B-quantized/llama65b-4bit-128g-new.safetensors --layers-dist 24:24:10:12 --test-generation wikitext2
...
The capital of New Mexico is Santa Fe.

qwopqwop200 / GPTQ-for-LLaMa

Fix multigpu #251