turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.57k stars 273 forks source link

how many time should it spend to convert llama2 70b to 2bit? on single V10032GB #139

Closed zlh1992 closed 9 months ago

zlh1992 commented 11 months ago

python convert.py \ -i ./Llama-2-70b-hf/ \ -o ./Llama-2-70b-hf/temp/ \ -c test.parquet \ -cf ./Llama-2-70b-hf/2bpw/ \ -b 2

it has already cost 10hours:

-- rfn max:1600 bpw:2.17263

turboderp commented 11 months ago

Yeah, sorry, it's a pending todo item to add some sanity checks on the arguments.

It can't produce a 2-bit model so it fails to find a solution that produces an average two bits per weight exactly. The minimum it can get to is 2.173, by the looks of it. The issue being that the bitrate it targets is the actual average number of bits used per weight, including the overhead of scales and other parameters.

So you'd have to stop it and try again with a higher bitrate (2.173 or more). At least the measurement it's already done is valid for any target bitrate, so as long you save the measurement.json file in the work directory, you can pass it to the quantized with the -m argument and it will resume from the the optimizing step.

Also, 2.17 is still a very low bitrate. I haven't had much coherent output from 70B models below 2.35bpw. But you can try of course.

zlh1992 commented 11 months ago

Yeah, sorry, it's a pending todo item to add some sanity checks on the arguments.

It can't produce a 2-bit model so it fails to find a solution that produces an average two bits per weight exactly. The minimum it can get to is 2.173, by the looks of it. The issue being that the bitrate it targets is the actual average number of bits used per weight, including the overhead of scales and other parameters.

So you'd have to stop it and try again with a higher bitrate (2.173 or more). At least the measurement it's already done is valid for any target bitrate, so as long you save the measurement.json file in the work directory, you can pass it to the quantized with the -m argument and it will resume from the the optimizing step.

Also, 2.17 is still a very low bitrate. I haven't had much coherent output from 70B models below 2.35bpw. But you can try of course.

thx! I got it.