[QAT] Low-bit FSDP all-gather for QAT

Had this idea and discussed briefly with @andrewor14.

Conceptually the current QAT + FSDP looks like this

However, we can do low-bit all-gather, since the weight can be quantized before all-gather

In terms of perf, basically we are comparing between (ignoring potential fusion surrounding this)

This might be a small perf win, especially when distributed comm is bottleneck. Might be useful for QAT recipes in torchtune.

This is probably a low priority, so just leave it here if anyone is interested to implement. Need to quantify the speedup, if any.

A tensor subclass to hold original weight + use FSDP2 all-gather extension: possibly extend this https://github.com/pytorch/ao/blob/000a49026459dd1dadf5ca34322d98e7b1680250/torchao/quantization/qat/affine_fake_quantized_tensor.py
Another tensor subclass to hold quantized weight. If AQT has basic support for backward, maybe we can use AQT directly. Otherwise, need to have another subclass.

pytorch / ao