Open rainyBJ opened 1 year ago
Same question and also does it matter to use 512 seqlen ? What if we have ram constraint will work the same way with 128 or 256 tokens ?
Hi, we have not extensively ablated the use of calibration sets. Feel free to try it out and compare the performance!
For small LM (7B) or NMT encoder-decoder models (< 1B) I found out that scaling / clipping was useless, it just worked only while quantizing directly. Does it make sense?
For small LM (7B) or NMT encoder-decoder models (< 1B) I found out that scaling / clipping was useless, it just worked only while quantizing directly. Does it make sense?
The scaling/clipping works for Llama-7B models. I am not sure about the smaller ones or enc-dec models.
I am not saying it does not work :) I am saying that WITHOUT it, it work too ....
For small LM (7B) or NMT encoder-decoder models (< 1B) I found out that scaling / clipping was useless, it just worked only while quantizing directly. Does it make sense?
Just curious: how small are the NMT encoder-decoder models that you tried? Do they have transformers in either encoder or decoder?
it was a base transformer.
Hi, is there any new progress on this issue? I also encountered the same problem.
Hi, I have a question about the calibration data: In calib_data.py, you re-organize the calib data so that every batch has the same sequence length and there's no need for padding. But will this affect the positional_embedding and further affect the data distribution during calibration? And furthermore, will padding tokens with attention mask affect the calibration process when I have to pad the data so that they can have the same length?