Closed l-bat closed 1 week ago
@l-bat are you sure that it gives any speedup? Becuase florence2 is small model <1b parameters, I do not think that it can get any benifit from weight compession.
@eaidova, you're right; the speedup is quite small at 1.07x. However, we can still benefit from compressing the weights to 4 bits: Model | FP16, Mb | U4, Mb | Compression rate |
---|---|---|---|
decoder | 185 | 67 | 2.8 |
decoder_with_past | 172 | 64 | 2.7 |
encoder | 83 | 24 | 3.5 |
image_embedding | 175 | 50 | 3.5 |
text_embedding | 76 | 38 | 2 |
With PTQ, we can achieve a 1.13x speedup, but I think this isn’t sufficient to justify adding quantization in this notebook
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB