tensorflow / compression

Data compression in TensorFlow
Apache License 2.0
850 stars 248 forks source link

Return num of read bits on `tfc.range_decode` #114

Closed bugerry87 closed 2 years ago

bugerry87 commented 2 years ago

Is your feature request related to a problem? Please describe. In my experiment I need to decode the range encoded string in an explorative manner. Because the shape of my CDF variates iteratively.

Describe the solution you'd like I would like to have tfc.range_decode to return how many bits have been read, perhaps as an optional flag. Based on that I can trim my code for the next iteration and decode the next few symbols.

Describe alternatives you've considered I considered to use tfc.unbound_range_en/decode, however, this function is a bit bulky and I don't have overflow symbols.

Additional context I wondering why tfc.range_encode can not handle CDF bins with 0-range (or rather 0.0 probability). This would also simplify my problem, because then I could input CDFs with fixed shapes but masked bins. I have my own Range Encoder implemented in python and it has no problem with 0.0 probabilities, unless the 0.0 prob was not true.

jonycgn commented 2 years ago

Hi bugerry87, this sounds like a nice set of additional features that could be useful.

As it happens, we just published a more general range coder in commit 61e7977a6e084fc60359cdb2e3f1005b475c7f1f. It supports en/decoding multiple tensors into the same bit stream. (And we're preparing to make a new release with that soon.) Can you check if this would solve your use case?

I will double check about 0-likelihood symbols. Maybe we can support that in the future.

ssjhv commented 2 years ago

Hi bugerry87,

Sorry for the delayed response. Let me answer on the "additional context" part.

We disabled support for CDF containing 0 probability bin because the most common error case was that 1. some values are never observed during training and the probability of that value flushes down to zero, but 2. during inference these values are encountered and the only thing we can do is raise an error.

Therefore we made the op such that by default, the CDF has contiguous domain, say [-A, B], and all values in the domain has non-zero probability, i.e., if x \in [-A, B] then Pr(x) \neq 0. For the values outside the interval, we either require the users to clip the input or to use UnboundedIndexRangeEn[De]code to fall back to non-entropy coding for outliers.

However, if this is getting in your way of research, there are options.

  1. Pass debug_level=0 to both RangeEncode and RangeDecode. This option disables checking CDF for strict monotonicity. Of course you should be careful when disabling the check. For reference please see https://github.com/tensorflow/compression/blob/369d398be937983b3abb7c5445400a6f5d55ffc9/tensorflow_compression/cc/ops/range_coding_ops.cc#L36

  2. As Johannes suggested, using the new range coder ops could solve your problem because we didn't add those runtime checks in those ops, yet. For reference please see https://github.com/tensorflow/compression/blob/369d398be937983b3abb7c5445400a6f5d55ffc9/tensorflow_compression/cc/ops/range_coder_ops.cc

jonycgn commented 2 years ago

Closing due to inactivity. Please reopen if this is still an issue.