Enable Blockwise Int4 quantized linear layer

User journey

Different quantize methods can be configured by --quantize_type="int8_per_channel"/"int4_per_channel"/"int8_blockwise"/"int4_blockwise" when running run_server.py, run_offline.py and run_interactive.py. (README is also updated accordingly)

Quantize bf16 checkpoint with int4 blockwise quantization for linear layers. (Embedding layers will be kept in int8 per-channel quant if int4 or blockwise quant is enabled)
```
python -m convert_checkpoints --input_checkpoint_dir=$input_ckpt_dir --output_checkpoint_dir=$output_ckpt_dir \
--quantize=True \
--quantize_type="int4_blockwise"
```

Run run_interactive, similar for run_offline.py and run_server.py

python run_interactive.py --model_name='llama-2' --size=7b --batch_size=2 --max_cache_length=2048 \
--quantize_weights=True --quantize_kv_cache=True \
--quantize_type="int4_blockwise" \
--checkpoint_path=$output_ckpt_dir \
--tokenizer_path=$tokenizer_path

Quantization config workflow

Quantization configs are stored in QuantizationConfig dataclass, Environment stores a QuantizationConfig instance. Model initiate quantized layers based on the quantization config in environment.

Int4 weight loading workflow:

There is no torch.int4, so convert_checkpoints will store the in4 weights in int8 container
When the jax state_dict is extracted from checkpoint in engine.py, we cast the int8 tensors to int4 JAX tensors.

New quantization support

Added {int8, int4} x {per_channel, blockwise} quantized linear layers.
Added asymmetric quant support to quant/dequant function and quantized layers, but it's not exposed to cmd config, for experimental purpose.

Changes:

Add quantization cmd configs.
Add support of int4 and blockwise quantization in convert_checkpoint.py, move quantization map to model class.
Implementation of blockwise quantized linear layer. Explored different implementations and only the einsum without flatten operands works the best, the others are still kept in case they become useful in the future.
Int4 weight loading in engine.py
Refactor the weight loading logic in engine.py
Update quantize and dequantize function (Enhancement accuracy by adding eps, round, blockwise and asymmetric quant support)

Test: Correctness

Added unit tests for quantize/dequantize function, quantized linear accuracy, blockwise quant sharding.
run_interactive to verify the int4 blockwise quant generates meaning tokens Latency
Summarize in an internal doc, it's not always outperforming int8 per-channel, but int4 blockwise matmul kernel is faster.

google / jetstream-pytorch