pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.24k stars 119 forks source link

[feature request] np.packbits / np.unpackbits, general BitTensors (maybe can be just tensors with dtype torch.bits8 or have a new dtype torch.bits introduced) and bit packed tensors utilities for saving memory / accesses, support for BitTensors wherever BoolTensors are used #292

Open vadimkantorov opened 4 years ago

vadimkantorov commented 4 years ago

A usecase: storing a full backtracking pointer matrix can be okay for needleman/ctc alignment (4x memory saving compared to uint8 representation), if 2bit data type is used. Currently it's possible to do this with bit manipulation magic, but probably not very efficient (store and load will require masking and shifting, not fused)

Another usecase: compressed BoolTensor for binary neural networks

Another usecase: extremely low-bit quantized representations.

Is something like this already implemented for quantization? Probably a simple version of this feature could be providing some explicitly utility functions like calculating size of the holder uint8 tensor, fused store and load functions (potentially explicitly batched, e.g. actual store is delayed until some aligned number of memory lines has arrived)

In NumPy the related functionality is np.packbits and np.unpackbits, however these are designed to work only with 1-bit contained type. 2-bit/4-bit would be cool as well.

On 1-bit side, another related project is RoaringBitmap https://github.com/RoaringBitmap/RoaringBitmap (http://roaringbitmap.org/) - for compressed bitsets for set operations.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @izdeby

ezyang commented 4 years ago

One difficulty is that in many points of our code we assume all tensor elements are addressable (for example, for views), and this would not be the case with bit-packed tensors.

vadimkantorov commented 4 years ago

I wonder if we could design some explicit pack/unpack/load/store/index util methods that would be enough for basic usage (like numpy does with packbits/unpackbits)

Maybe we could have some unpack method that is optimal if the user themselves provided dtype-aligned indexes

This new bittensors wouldn't be first class objects, but still utility methods could be enough for first experimentation.

Maybe unpack method could be some variant of narrow that performs a single memory access if the index/length are aligned with container dtype. NestedTensor/vmap could be used to represent the returned unpacked byte tensor list

vadimkantorov commented 4 years ago

A simple interface could be packbits / unpackbits like in NumPy with additional bitness argument (to support 1-bit, 2-bit and 4-bit) and dim argument. It should maybe support out argument for unpacked uint8 tensor. Unpacked dimension could always be a new zeroth dimension.

vadimkantorov commented 4 years ago

https://www.microsoft.com/en-us/research/uploads/prod/2018/02/KoPhiliposeTashevZarar_ICASSP_2018.pdf suggests that XNOR and POPCNT functionality is useful for 1-bit networks

vadimkantorov commented 4 years ago

Arguments that can be helpful for packbits/unpackbits: 1) mask - a bit mask integer, specifying pack/unpack compress mask like in compress expand instructions -> this is slightly more flexible than a single nbits=1|2|4 arg)

2) dim -> packing / unpacking along a given dim (during unpacking it can then only be done across an already existing dim, maybe that's fine for dense dimensions)

3) target dim size may be needed for unpacking to undo the padding

4) out argument

I guess on CPU packbits/unpackbits can be implemented with those compress/expand SIMD instructions, if the op is performed across contiguous dimension (actually contiguity doesn't matter after a load to a vector register is done already)

vadimkantorov commented 4 years ago

in my code I'd do sth like: torch.packbits(something.argmax(dim = -1), mask = 0b11, dim =-1, out = my_uint8_array[k])

vadimkantorov commented 4 years ago

I think one can think about this feature request as surfacing compress/expand SIMD functionality to user land and reimplementing it on GPU

gchanan commented 4 years ago

pack and unpack seem worth doing. The other parts (i.e. compress/expand) could be useful, but I'm not sure it's worth doing -- it seems like at that point you'd be writing specialized ops in C++ anyway.

vadimkantorov commented 4 years ago

I thought that given a mask pack/unpack are precisely equivalent to SIMD compress/expand? Aren’t they?

ezyang commented 4 years ago

We haven't looked! You're probably right :)

vadimkantorov commented 4 years ago

General int4 support request in https://github.com/pytorch/pytorch/issues/33859 is also related

vadimkantorov commented 4 years ago

Another useful functionality would be scatter/gather-like functionality for compressing index tensors.

In a practical usecase it can help to compress the hybrid sparse+dense tensor indices by a lot: https://discuss.pytorch.org/t/sparse-torch-topk/71832/4

vadimkantorov commented 4 years ago

It seems that <8bit quantization starts to appear: https://github.com/pytorch/pytorch/pull/34783 and seems somewhat related to this discussion

ezyang commented 4 years ago

cc @jspark1105

vadimkantorov commented 4 years ago

related: https://github.com/pytorch/pytorch/issues/36380

vadimkantorov commented 4 years ago

I renamed to enlarge the scope a little bit :) BitTensors could be very helpful for binary neural networks. Even if few operators on them are supported (such as bit packbits/unpackbits, binary operations, popcnt), they are already useful for reducing memory footprint, e.g. for storing masks instead of full inputs when sufficient for backward ops. E.g. in https://github.com/pytorch/pytorch/pull/41034 if a mask is stored, the silu/swish operation would become bijective if additional bit is stored to represent direction (half-space) away from function minimum.

vadimkantorov commented 4 years ago

I made a draft (https://gist.github.com/vadimkantorov/30ea6d278bc492abf6ad328c6965613a):

import math
import torch

def tensor_dim_slice(tensor, dim, s):
    return tensor[(slice(None),) * (dim if dim >= 0 else dim + tensor.dim()) + (s, )]

def packshape(shape, dim, mask, dtype):
    nbits_element = torch.iinfo(dtype).bits
    nbits = 1 if mask == 0b00000001 else 2 if mask == 0b00000011 else 4 if mask == 0b00001111 else 8 if mask == 0b11111111  else None
    assert nbits is not None and nbits <= nbits_element and nbits_element % nbits == 0
    packed_size = nbits_element // nbits
    shape = list(shape)
    shape[dim] = int(math.ceil(shape[dim] / packed_size))
    return shape, packed_size, nbits

def packbits(tensor, dim = -1, mask = 0b00000001, out = None, dtype = torch.uint8):
    shape, packed_size, nbits = packshape(tensor.shape, dim = dim, mask = mask, dtype = dtype)
    out = out.zero_() if out is not None else torch.zeros(shape, device = tensor.device, dtype = dtype)
    assert tuple(out.shape) == tuple(shape)
    for e in range(packed_size):
        sliced_input = tensor_dim_slice(tensor, dim, slice(e, None, packed_size))
        compress = (sliced_input << (nbits * (packed_size - e - 1)))
        sliced_output = out.narrow(dim, 0, sliced_input.shape[dim])
        sliced_output |= compress
    return out

def unpackbits(tensor, shape, dim = -1, mask = 0b00000001, out = None, dtype = torch.uint8):
    _, packed_size, nbits = packshape(shape, dim = dim, mask = mask, dtype = tensor.dtype)
    out = out.zero_() if out is not None else torch.zeros(shape, device = tensor.device, dtype = dtype)
    assert tuple(out.shape) == tuple(shape)
    for e in range(packed_size):
        sliced_output = tensor_dim_slice(out, dim, slice(e, None, packed_size))
        expand = (tensor >> (nbits * (packed_size - e - 1))) & ((1 << nbits) - 1)
        sliced_input = expand.narrow(dim, 0, sliced_output.shape[dim])
        sliced_output.copy_(sliced_input)
    return out

if __name__ == '__main__':
    shape = (10, 17)
    K = 10
    for nbits in [1, 2, 4, 8]:
        mask = (1 << nbits) - 1
        for dtype in [torch.uint8, torch.int32, torch.int64]:
            for k in range(K):
                x = torch.randint(0, 1 << nbits, shape, dtype = dtype)
                y = packbits(x, mask = mask)
                z = unpackbits(y, mask = mask, dtype = x.dtype, shape = x.shape)
                assert torch.allclose(x, z)

Discussion about including in core tensor_slice is in https://discuss.pytorch.org/t/use-python-like-slice-indexing-across-a-given-dimension/89606/7

vadimkantorov commented 4 years ago

Any advice on how to fuse this properly and for cuda? Should just torch.jit.script work?

vadimkantorov commented 4 years ago

An efficient version for padded tensors so that the relevant dim is multiple of 8 for BoolTensors, multiple of 4 for 2-bit-valued tensors:

def packbits_padded(tensor, dim = -1, mask = 0b1, out = None, dtype = torch.uint8):
    dim = dim if dim >= 0 else dim + tensor.dim()
    nbits_element, nbits = 8, (1 if mask == 0b00000001 else 2 if mask == 0b00000011 else 4 if mask == 0b00001111 else 8 if mask == 0b11111111 else None)
    nibbles = nbits_elmement // nbits

    assert tensor.shape[dim] % nibbles == 0

    out = out if out is not None else torch.empty(*tensor.shape[:dim], tensor.shape[dim] // nibbles, *tensor.shape[1 + dim:], dtype = dtype, device = tensor.device)
    shift = torch.arange(nbits_element - nbits, -1, -nbits, dtype = torch.uint8, device = tensor.device)
    shift = shift.view(nibbles, *((1,) * (tensor.dim() - dim - 1)))
    torch.sum(tensor.view(*tensor.shape[:dim], -1, nibbles, *tensor.shape[1 + dim:]) << shift , dim = 1 + dim, out = out)

def unpackbits_padded(tensor, dim = -1, mask = 0b1, out = None):
    dim = dim if dim >= 0 else dim + tensor.dim()
    nbits_element, nbits = 8, (1 if mask == 0b00000001 else 2 if mask == 0b00000011 else 4 if mask == 0b00001111 else 8 if mask == 0b11111111 else None)
    nibbles = nbits_elmement // nbits

    out = out if out is not None else torch.empty(*tensor.shape[:dim], tensor.shape[dim] * nibbles, *tensor.shape[1 + dim:], dtype = torch.uint8, device = tensor.device)
    shift = torch.arange(nbits_element - nbits, -1, -nbits, dtype = torch.uint8, device = tensor.device)
    shift = shift.view(nibbles, *((1,) * (tensor.dim() - dim - 1)))
    torch.bitwise_and((tensor.unsqueeze(1 + dim) >> shift).view_as(out), mask, out = out)

Couple issues that would bring further speed-up:

ezyang commented 4 years ago

If the bitwise ops you need are supported by the fuser, yeah, the JIT fuser should be able to speed some of the things up. I didn't read your code so I don't know for certain if this is the case or not.

vadimkantorov commented 4 years ago

Is it possible to ask the fuser to do loop unrolling? It hopefully may exploit op reordering / asynchrony / parallelism. In this case the number of loop iterations can be known semi-statically.

Practically speaking, I'm using the padded version now which runs much faster.

Some problems encountered with JIT: shape = list(out.shape) didn't work. out.shape[:-1] + (1,) didn't work in JIT either. iinfo also didn't work, slice(None) didn't work. |=, >>= didn't work, typing shape argument also didn't work (maybe was possible somehow, but I quickly couldn't figure it out)

ezyang commented 4 years ago

^ @suo

vadimkantorov commented 4 years ago

(code in https://gist.github.com/vadimkantorov/30ea6d278bc492abf6ad328c6965613a)

about shapes: (shape[:dim] + (int(math.ceil(shape[dim] / nibbles)), ) + shape[1 + dim:]) does not JIT compile:

  File "packbits.py", line 10, in <module>
    def packshape(shape, dim : int = -1, mask : int = 0b00000001, dtype = torch.uint8):
  File "/miniconda/lib/python3.7/site-packages/torch/jit/__init__.py", line 1290, in script
    fn = torch._C._jit_script_compile(qualified_name, ast, _rcb, get_default_args(obj))
RuntimeError:
Arguments for call are not valid.
The following variants are available:

  aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> (Tensor):
  Expected a value of type 'Tensor' for argument 'other' but instead found type 'Tuple[int]'.

  aten::add.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> (Tensor):
  Expected a value of type 'number' for argument 'other' but instead found type 'Tuple[int]'.

  aten::add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> (Tensor(a!)):
  Expected a value of type 'Tensor' for argument 'other' but instead found type 'Tuple[int]'.

  aten::add(str a, str b) -> (str):
  Expected a value of type 'str' for argument 'a' but instead found type 'Tensor'.

  aten::add.t(t[] a, t[] b) -> (t[]):
  Could not match type Tensor to List[t] in argument 'a': Cannot match List[t] to Tensor.

  aten::add.int(int a, int b) -> (int):
  Expected a value of type 'int' for argument 'b' but instead found type 'Tuple[int]'.

  aten::add.float(float a, float b) -> (float):
  Expected a value of type 'float' for argument 'b' but instead found type 'Tuple[int]'.

  aten::add.int_float(int a, float b) -> (float):
  Expected a value of type 'float' for argument 'b' but instead found type 'Tuple[int]'.

  aten::add.float_int(float a, int b) -> (float):
  Expected a value of type 'int' for argument 'b' but instead found type 'Tuple[int]'.

  aten::add(Scalar a, Scalar b) -> (Scalar):
  Expected a value of type 'number' for argument 'b' but instead found type 'Tuple[int]'.

  add(float a, Tensor b) -> (Tensor):
  Expected a value of type 'Tensor' for argument 'b' but instead found type 'Tuple[int]'.

  add(int a, Tensor b) -> (Tensor):
  Expected a value of type 'Tensor' for argument 'b' but instead found type 'Tuple[int]'.

The original call is:
  File "packbits.py", line 15
    assert bits is not None and nibble is not None and nibble <= bits and bits % nibble == 0
    nibbles = bits // nibble
    return (shape[:dim] + (int(math.ceil(shape[dim] / nibbles)), ) + shape[1 + dim:]), nibbles, nibble
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    out = out if out is not None else torch.empty(*tensor.shape[:dim], tensor.shape[dim] // nibbles, *tensor.shape[1 + dim:], dtype = dtype, device = tensor.device)
                                                  ~~~~~~~~~~~~~~~~~~ <--- HERE
    shift = torch.arange(bits - nibble, -1, -nibble, dtype = torch.uint8, device = tensor.device)
    shift = shift.view(nibbles, *((1, ) * (tensor.dim() - dim - 1)))
vadimkantorov commented 4 years ago

bit packing could also be used for storing masks for backward (dropout, relu, leaky_relu) instead of full input for many operations

vadimkantorov commented 4 years ago

@aayn numpy's packbkits is quite restricted and supports packing only BoolTensors. I propose here to expand its functionalities to custom masks and arbitrary data types.

Need for fully-fledged BitTensors also needs some discussion. Same for relation with existing 2-bit quantization and quantization in general, since BitTensors and low-precision quantization could be used for binary neural net works.

vadimkantorov commented 4 years ago

More 2-bit packing related to quantization: https://github.com/pytorch/pytorch/pull/43077

vadimkantorov commented 4 years ago

4-bit tensor dtype: https://github.com/pytorch/pytorch/pull/44574

vadimkantorov commented 3 years ago

torch.sum and torch.count_nonzero for bool types should be equivalent. is it the case? Do they allocate any intermediate tensors and upcast? In the ideal case, the should also not upcast too much, if the maximum sum possible fits into uint16 or uint32

POPCNT would be the best though :)

vak commented 3 years ago

yet another case, where int8 (and not packed booleans) could have made better use of the very valuable GPU RAM:

import numpy as np
import torch
from torch import from_numpy

sz = 2900*1024**2
print(f"sz={sz/1024**2:.1f}MB") # ==> sz=2900.0MB

tensor_vcf = np.ones(sz, dtype=np.bool) # it's just 2.9GB, but you can't sum it up into int32 straight-forward using 15 GPU RAM of Google Colab

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
cuda1_vcf = from_numpy(tensor_vcf).to(device)

cuda1_vcf.sum(dtype=torch.int32) # ==>  RuntimeError: CUDA out of memory. Tried to allocate 11.33 GiB (GPU 0; 14.73 GiB total capacity; 2.83 GiB already allocated; 10.96 GiB free; 2.83 GiB reserved in total by PyTorch)

it looks like an int32-copy of the original int8 tensor is created to perform .sum(dtype=torch.int32)

vadimkantorov commented 3 years ago

@vak maybe file a separate issue about this

blueardour commented 3 years ago

Mark. Have been searching for such a function for several months. Above code https://github.com/pytorch/ao/issues/292 works. Expecting more efficient implmentation.

ezyang commented 3 years ago

bumping prio based on user activity

vadimkantorov commented 3 years ago

I think there are several topics here: packing / unpacking, low-bit types in general (and ops for them - maybe different low-bit type representations could exist as well...) and about more support for bit operations (such as popcnt and maybe some others required for binary neural nets)

vadimkantorov commented 3 years ago

May be also relevant to 1-bit Adam in DeepSpeed: https://www.deepspeed.ai/tutorials/onebit-adam/

Having native support for low-bit tensors could help prototyping new compacting distributed optimization techniques and quantization: https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html

vadimkantorov commented 3 years ago

This is also useful for image region superpixel representations, especially if indexing is done fast and without allocations

vadimkantorov commented 3 years ago

Probably detectron2 would also make use of this in BitMasks (https://detectron2.readthedocs.io/en/latest/modules/structures.html#detectron2.structures.BitMasks)

ezyang commented 3 years ago

I'm working on a feature that would make it easier for people to develop new dtypes out of tree.

vadimkantorov commented 3 years ago

Related: https://github.com/pytorch/pytorch/pull/65545

vadimkantorov commented 2 years ago

Another useful functionality is setting bits by integer indices and getting set bits as integer indices (i.e. representing sets / bitsets)

vadimkantorov commented 2 years ago

Another useful binary-related function: run-length encoding (RLE), used in segmentation for compressed storage of instances (https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/mask.py)

ezyang commented 2 years ago

Quantization has added handling for 1/2/4 bit dtypes, but they didn't handle general striding for these cases. Here is their size formula:

int64_t get_sub_byte_tensor_size(IntArrayRef sizes, size_t dtype_itemsize, at::ScalarType t) {
  // NOLINTNEXTLINE(cppcoreguidelines-init-variables)
  int64_t element_per_byte;
  switch(t) {
    case at::ScalarType::QUInt4x2:
      element_per_byte = 2;
      break;
    case at::ScalarType::QUInt2x4:
      element_per_byte = 4;
      break;
    default:
      element_per_byte = 1;
  }
  // zero dim tensor
  if (sizes.size() == 0) {
    return c10::multiply_integers(sizes) * dtype_itemsize;
  }
  // Consider most inner dim as cols
  int64_t cols = sizes.at(sizes.size()-1);
  int64_t bytes_per_row = cols * dtype_itemsize;
  // align qtensor most inner dim, compute ceil (bytes_per_row / element_per_byte)
  return c10::multiply_integers(IntArrayRef(sizes.data(), sizes.size() - 1)) * at::ceil_div(bytes_per_row, element_per_byte);
}

The assumption seems to be that the innermost dim gets packed and rounded and then the outer dims are normal. This means certain restridings are invalid, in particular all strides must maintain the innermost dimension being contiguous.

vadimkantorov commented 2 years ago

These low-bit dtypes are useful also outside of "quantization" per se

Would be nice to have the most basic functions: compress booltensor into bittensor and decompress (or into more generic 1/2/4-nibble based). A pilot case could be saving the dropout mask as bittensor for memory saving. Supporting for binary comparison ops to return bittensor instead of booltensor may be good too (not in the operator form). E.g. a case in relu/dropout fusion: https://gist.github.com/vadimkantorov/360ece06de4fd2641fa9ed1085f76d48

vadimkantorov commented 2 years ago

case at::ScalarType::QUInt4x2: element_per_byte = 2; break; case at::ScalarType::QUInt2x4: element_per_byte = 4; break; default: element_per_byte = 1;

it seems that 1-bit isn't supported here. it would correspond to element_per_byte = 8, right?

ezyang commented 2 years ago

Oops, you're right!

vadimkantorov commented 2 years ago

Related: https://github.com/pytorch/pytorch/issues/74627

vadimkantorov commented 2 years ago

Another real-world usecase: packbits was useful for InstantNGP implementation by @kwea123: https://github.com/kwea123/ngp_pl/blob/05c96e314175d1de75a468cddbc264202cef5b81/models/csrc/packbits.cu (fused gt+packbits kernel - so more like 1bit quantization) - so maybe this could be bound to to torch.gt(..., ..., dtype = torch.bit) if BitTensors are implemented

vadimkantorov commented 2 years ago

this would also be useful as output dtype of torch.bernoulli bool tensor generation, e.g. for dropout masks: https://gist.github.com/vadimkantorov/360ece06de4fd2641fa9ed1085f76d48#file-reludropoutinplace-py-L13

ideally, bittensors should be supported in all places where currently booltensors are supported (including maskedfill)

ezyang commented 2 years ago

I think the most practical route to bittensors is to add enough bitwise primitives on int (of some size) tensors as necessary, implement operations as Python decompositions, and then rely on torchinductor to compile them into reasonable kernels.

vadimkantorov commented 2 years ago

There is also an aspect that for binary ops operating on a float tensor and on bittensor, some sort of cached reads might need to be implemented because of different granularity - a single read of the underlying memory element of bittensor can correspond to many elements of the float tensor (and I guess this unpacking can be done in multiple ways - the simplest for CUDA can probably be doing nothing special and performing this memory read multiple times in the hope that it's going to be cached. but for CPU and making use of masked AVX ops, vectorized reads can be done)

vadimkantorov commented 1 year ago

Related: https://github.com/pytorch/vision/issues/6818#issuecomment-1295163095 on missing vectorization for bitwise shift ops on CPU