sands-lab / grace

GRACE - GRAdient ComprEssion for distributed deep learning
https://sands.kaust.edu.sa/project/grace/
BSD 2-Clause "Simplified" License
132 stars 45 forks source link

Global or Local compression #9

Closed ABukharev closed 4 years ago

ABukharev commented 4 years ago

Dear authors,

I used issue form to submit the general question about the implementation details. The framework works just fine. Thanks, great job! :)

You utilized Horovod's API to construct compression (Compressor) and error-feedback (ResidualMemory) classes. As far as I know the key methods (compress/ decompress) are applied to the tensors (layers's parameters) separately (please correct me if I am wrong). In some cases (top-k compression) the statistics obtained from all the layers are required (see https://arxiv.org/pdf/1712.01887.pdf as an example).

The question is next: is it a correct way to apply the compression methods designed to compress the entire update (like top-k) to each layer separately?

If it is possible, please, provide your standpoint!

ABukharev commented 4 years ago

Please let me know if my question is not clear! I will try to provide a simple example.

aritra-dutta commented 4 years ago

Hi Aleksandr,

Thank you for your question! It was very clear and you are due for a treat!

Indeed we can point out to you a paper. Please check out our recent AAAI 2020 paper where we answered the exact question you had in mind, that is, we have provided a detailed insight into the difference between the layer-wise and entire model compression. In this paper, we argued that in light of the present theoretical analysis of compressed SGD in a distributed set-up, the noise incurred in a layer-wise compression is bounded above by the noise incurred when the same compression is applied to the entire DNN model. That is, one can provide a sharper noise bound if the compression applied in a layer-wise fashion. However, this observation may not always translate into practice. For example, for Top-k compression with small sparsification ratio k and small model sizes, entire model compression outperforms layer-wise compression.

As you have noticed, in the present theoretical analyses of compressed and communication efficient distributed SGD, the layer-wise artefact is not considered while analyzing the performance of implemented compressor(s). We were the first one to point out this discrepancy irrespective of the nature of the compressor used---biased or unbiased! For the full version of our paper with proofs please check our technical report.

Again many thanks for your interest in our work. Stay in touch.

Best, Aritra