microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications
MIT License
364 stars 36 forks source link

Question about perplexity results shown on the paper #118

Closed moonlightian closed 7 months ago

moonlightian commented 7 months ago

image Hi! Thank you for your excellent works! SparseGPT2:4 may set 50% parameters into zero which makes sparsity 50%. Is it more persuasive to set SliceGPT with 50% slicing rate here instead?

nailimixaM commented 7 months ago

Thanks @moonlightian, great observation - here's what we said in the main text:

We note that the WikiText2 perplexity of SliceGPT at 50% is worse than SparseGPT 2:4, but the throughput is much higher than could be achieved with a sparse method that does not slice X. (X = the activations flowing through the transformer)

Structured sparsity methods (like SliceGPT) can not outperform unstructured sparsity all else being equal, but trade-off performance with other benefits like token throughput, memory footprint etc that are not easily achieved in unstructured sparsity methods.