Incorrect treatment of array dimensions

seung-lab / fpzip

Cython bindings for fpzip, a floating point image compression algorithm.

BSD 3-Clause "New" or "Revised" License

33 stars 5 forks source link

Incorrect treatment of array dimensions #11

Open lindstro opened 5 years ago

lindstro commented 5 years ago

First, a big thanks for making these fpzip wrappers available! Skimming the code and seeing user comments about the need to transpose arrays, I wanted to get to the root cause of this issue.

My concerns are with this piece of code:

fpz_ptr[0].nx = data.shape[0] fpz_ptr[0].ny = data.shape[1] fpz_ptr[0].nz = data.shape[2] fpz_ptr[0].nf = data.shape[3]

fpzip expects an array layout with x varying faster than y, y varying faster than z, and z varying faster than f. In other words, nf 3D arrays of dimensions nx ny nz should have this C layout:

float array[nf][nz][ny][nx];

AFAIK, shape[3] is the fastest varying dimension in a numpy ndarray; shape[0] is the slowest varying dimension. If the array is not reshaped, this will result in the data being interpreted incorrectly, leading to poor compression.

Furthermore, fpzip assumes that the data is correlated along x, y, and z, but not along f (nf denotes the number of independent fields/variables/components). Aside from this, fpzip handles single (nf = 1) or multiple (nf > 1) arrays that may be 1D (nx > 1, ny = nz = 1), 2D (nx > 1, ny > 1, nz = 1), or 3D (nx > 1, ny > 1, nz > 1).

Finally, if the ndarray data to be compressed is a stack of F 2D images of dimensions M * N (with M varying fastest), then some remapping of dimensions would be needed so that nx = M, ny = N, nz = 1, nf = F. This would instruct fpzip to make F separate calls to compress 2D arrays. As is, this code would be invoked

if len(data.shape) == 3: data = data[:,:,:, np.newaxis ]

which results in correct behavior for truly 3D data (modulo transposition) but incorrect behavior in case of a stack of unrelated 2D arrays.

The caller would need to know how to reshape the ndarray to make it 4D with the expected layout.

Absent a more elaborate API that allows specifying which dimensions are correlated and what the underlying layout of the array is, it would seem best to map the last ndarray dimension to nx, the second to last to ny, etc. A more robust solution would be to examine the ndarray strides to determine how to map the dimensions. This still does not address the issue of uncorrelated dimensions.

william-silversmith commented 5 years ago

Thanks for investigating this issue! I've been working on several libraries and the most dire issue I've been faced with, more than the difficulty of any algorithm, has been the variation in strides, C, and Fortran ordering in arrays. Thank you for the tip about ndarray.strides, I didn't know about it.

Some of the most recent work I've done on this issue is located here, where I rely on the user to specify C or Fortran order.

  cdef int sx = data.shape[2]
  cdef int sy = data.shape[1]
  cdef int sz = data.shape[0]
  cdef float ax = anisotropy[2]
  cdef float ay = anisotropy[1]
  cdef float az = anisotropy[0]

  if order == 'F':
    sx, sy, sz = sz, sy, sx
    ax = anisotropy[0]
    ay = anisotropy[1]
    az = anisotropy[2]

This is in part because not only is there XYZC vs CZYX in the array ordering, numpy's indexing is [ row, col, depth ] which could be considered YXZ. Additionally, a programmer can simply imagine within the hermetic seal of their brain an array to be in a different orientation and wish to process it that way (or actually address the array as ZYX when the underlying memory appears to by XYZ).

Amusingly, as you noted, the "C" orientation that numpy outputs is in what is usually termed "Fortran" order, while "F" orientation outputs bytes that are ordinarily considered to be "C" order.

x = np.zeros((2,2,2), dtype=np.uint8, order='C')

x[0,0,0] = 1
x[1,0,0] = 2
x[0,1,0] = 3
x[1,1,0] = 4
x[0,0,1] = 5
x[1,0,1] = 6
x[0,1,1] = 7
x[1,1,1] = 8

print(x.tobytes('C'))
print(x.tobytes('F'))

b'\x01\x05\x03\x07\x02\x06\x04\x08'
b'\x01\x02\x03\x04\x05\x06\x07\x08'

I think in the case of uncorrelated (or weakly correlated in the case of the anisotropic data my lab uses) 2D arrays, we could add a flag image_stack=True to the compress function that would invert the third and fourth dimension iff the given third dimension is > 1 and the fourth dimension is 1.

shangmu commented 5 years ago

@william-silversmith Where did you get the info "numpy's indexing is [ row, col, depth ]"?

>>> print(x)
[[[1 5]
  [3 7]]

 [[2 6]
  [4 8]]]

It seems numpy is simply treating [0,1,2] as [0][1][2], and indexing is [ depth, row, col ]:

>>> np.matmul(np.zeros((2,2)),np.zeros((3,1,2)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: shapes (2,2) and (3,1,2) not aligned: 2 (dim 1) != 1 (dim 1)
>>> np.matmul(np.zeros((2,2)),np.zeros((3,2,1)))
array([[[ 0.],
        [ 0.]],

       [[ 0.],
        [ 0.]],

       [[ 0.],
        [ 0.]]])

>>> np.matmul(np.zeros((2,2)),[1, 2])
array([ 0.,  0.])
>>> np.matmul(np.zeros((2,2)),[[1, 2]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: shapes (2,2) and (1,2) not aligned: 2 (dim 1) != 1 (dim 0)

william-silversmith commented 5 years ago

I think I was confused and your insight helps a lot. Numpy in 2D was doing row, col (YX) and I didn't understand how it would logically continue the series. Extending it to ZYX makes a lot more sense than going to YXZ, which explains why my linked EDT code works the way it does. However, it remains confusing because this is said to be numpy's "C" order.

william-silversmith commented 5 years ago

@lindstro I think I have something working. Without changing fpzip's internals to transit the array in F order, your suggestion did substantially reduce the size of the output, but it did not reduce it as much as a transpose.

Here's my results from #12 :

import fpzip
import numpy as np
import gcsfs
import pandas as pd
import xarray as xr

gcs = gcsfs.GCSFileSystem(project='pangeo-181919', token='anon')
ds_llc_sst = xr.open_zarr(gcsfs.GCSMap('pangeo-data/llc4320_surface/SST',
                                gcs=gcs), auto_chunk=False)
data = ds_llc_sst.SST[:5, 1, 800:800+720, 1350:1350+1440].values

def test(data, order):
  x = fpzip.compress(data, order=order)
  print(order, len(x), "{:.2f}%".format(len(x) / data.nbytes * 100))

print("Contiguous")
data = np.ascontiguousarray(data)
test(data, 'C')
test(data, 'F')

print("Fortran")
data = np.asfortranarray(data)
test(data, 'C')
test(data, 'F')

Output:

Contiguous
C 10984654 52.97%
F 13081082 63.08%
Fortran
C 11825537 57.03%
F 10067732 48.55%

For contrast, the current version on master would output row 2, 63.08% (the worst one) if used without transposition.

lindstro commented 5 years ago

@william-silversmith It's a bit difficult for me to judge what is being done here. What does "test" do? In particular, how does it feed array dimensions to fpzip? What do the various data dimensions correspond to? Should I interpret this as 5 separate 2D scalar fields of dimensions 1440 x 720 (with 1440 varying fastest)? This should be presented to fpzip as nx = 1440, ny = 720, nz = 1, nf = 5. If the scalar fields are correlated, then nx = 1440, ny = 720, nz = 5, nf = 1 could be beneficial.

I'm not sure that np.ascontiguousarray does anything here, although I'm not a Python programmer. In order for me to understand better, can you print out nx, ny, nz, and nf as presented to fpzip for each of the above calls to test?

william-silversmith commented 5 years ago

In #12 I've added a parameter order to control how fpzip reads in fx, fy, fz, and fn. For order=C, reads it as:

fpz_ptr[0].nx = data.shape[3]
fpz_ptr[0].ny = data.shape[2]
fpz_ptr[0].nz = data.shape[1]
fpz_ptr[0].nf = data.shape[0]

If the array is in Fortran order ('F') it reads it in the original way. I added in a special case for C order for when the array is 3D to ensure that nf is set to 1 which improves the compression ratio further than the results above. With that special case added in, here are the new results:

Contiguous
C 9992639 48.19%
F 13081082 63.08%
Fortran
C 12197109 58.82%
F 10067732 48.55%

You are correct that the np.ascontiguousarray should have no effect in this case. I was just being very explicit. np.asfortranarray should be the same as performing a transpose on the underlying memory buffer without changing the numpy axis order in the python interface. I updated the text of #12 to be more explicit about what I'm doing.

lindstro commented 5 years ago

Thanks for working on this. I'm still unclear on what nx, ny, nz, and nf are set to in each of the four cases that you report results for. This would be useful to know in order to understand how each setting interprets the array. I'm guessing that the first (48.19%) and last (48.55%) settings both interpret the array correctly, but that the array is traversed along one rather than the other dimension first, which will usually have a small but measurable impact on the compression ratio.

fpzip supports 2D (and, less effectively, 1D) data as well. I'm not entirely sure if a single 2D numpy array is handled correctly when specifying C or Fortran order. This would be good to verify by printing (nx, ny, nz, nf).

william-silversmith commented 5 years ago

Here's the same test with the shape and nx,ny,nz,nf printed out before each test executes. In all cases, data.shape was (5, 720, 1440, 1)

Contiguous
    nx:  1440 ny:  720 nz:  5 nf:  1
    C 9992639 48.19%

    nx:  5 ny:  720 nz:  1440 nf:  1
    F 13081082 63.08%

Fortran
    nx:  1440 ny:  720 nz:  5 nf:  1
    C 12197109 58.82%

    nx:  5 ny:  720 nz:  1440 nf:  1
    F 10067732 48.55%

I think the F order version should handle 1D and 2D correctly as they will be padded with 1s on the right hand side. However, the C order version will be nx=1 in the 2D case, and nx=1, ny=1 in the 1D case. Let me fix that.

william-silversmith commented 5 years ago

I've updated #12 to handle dimensions in C order for all dimensions up to 4.

william-silversmith commented 5 years ago

@lindstro I'm feeling fairly confident in the latest PR. Is there anything in there you'd like me to change or if there's additional data you'd like me to share? If not, I'd be happy to merge it and put out a new version.

lindstro commented 5 years ago

@william-silversmith Let me make sure I understand the above examples. The 'contiguous' and 'Fortran' specifications are applied to the numpy array, which potentially results in a change of layout (data elements are permuted). The 'C' and 'F' specifications are provided to your library, correct? If that's true, then the results make sense to me. As a final sanity check, it might make sense to also print the ndarray strides for each example.

In this example, the ndarray is always 4D. It would be useful to include similar examples for 2D and 3D arrays to verify correctness, as well as examples using multiple fields (nf > 1).

william-silversmith commented 5 years ago

Yes, that's correct. 'C' and 'F' are manually provided to the library via the order parameter.

I conducted a test using some Fortran ordered float32 4D data from my lab sized 128x128x64x3. I made sure that a compression and decompression cycle resulted in identical output for 4D, 3D, 2D, and 1D versions of the array. I also recorded the compression ratios. I'm checking to see if I can share the data.

Input Array Strides

Note: Input strides are in bytes

nf  input strides              [ nx,  ny, nz, nf]
4   (4, 512, 65536, 4194304)   [128, 128, 64,  3]
3   (4, 512, 65536)            [128, 128, 64,  1]
2   (4, 512)                   [128, 128,  1,  1]
1   (4,)                       [128,   1,  1,  1]

Fortran Order Result

nf    input bytes  output bytes   ratio
4        12582912       7404747   0.588
3         4194304       2424333   0.578
2           65536         38282   0.584
1             512           368   0.718

Contiguous Order Result

Note: identity test correctly fails under this condition

nf    input bytes  output bytes   ratio
4        12582912       8613203   0.684
3         4194304       2952513   0.703
2           65536         38282   0.584
1             512           368   0.718

Code

def identity(image, order):
  return fpzip.decompress(fpzip.compress(image, order=order), order=order)

def print_ratio(img, order='F'):
  res = len(fpzip.compress(img, order=order))
  print(img.nbytes, '\t', res, '\t', res / img.nbytes)

img = np.load('affinity.npy') # 128x128x64x3
print_ratio(img)
assert np.all(img == identity(img, order='F'))

# test 3d, 2d, 1d
for _ in range(3):
  img = img[..., 0]
  result = identity(img, order='F')
  result = np.squeeze(result)
  print_ratio(img)
  assert np.all(img == result)

lindstro commented 5 years ago

Thanks, but I'm not sure this addresses the issue I was trying to get at. That's my fault, as I wasn't being very precise.

First, I suspect that the affinity data you're using is symmetric, such that the 2D slice being compressed is identical in C and Fortran order (it equals its transpose). Is that correct? Otherwise, it seems unlikely that the data would compress to the exact same size regardless of ordering. Can you use a data set that is not symmetric and whose dimensions all differ so that the effects of different orderings can better be assessed? This can be a synthetic data set, e.g., the tensor product of sinusoids. The three fields should not be correlated, however.

Second, what I was hoping to see for each row are the dimensions of the numpy array being compressed (ndarray.shape), what its strides are in the given (C or Fortran) ordering (ndarray.strides), what nx, ny, nz, and nf are set to as input to fpzip, and the resulting compression ratio.

Finally, to verify correctness, it would be good to test that numpy arrays in any order are handled properly. Ideally the user should not have to specify the order. Rather, the library should map ndarray.shape to the proper nx, ny, nz, and nf values based on ndarray.strides. This would be greatly facilitated if fpzip allowed for the specification of strides, as zfp does, and may be something I will add to the next release. Currently, there's no reasonable way to compress a numpy array with shape=(1920, 1080, 3) (e.g., an RGB image) in C order, as in this example the uncorrelated color component varies fastest. You'd want to compress this as nx=1920, ny=1080, nz=1, nf=3, but with only default strides the data would have to be physically transposed first to shape=(3, 1080, 1920). This is not something that can be rectified by specifying C or Fortran order, as that does not affect the underlying physical layout in memory.

Another issue here is that in the previous example, the three color components are not correlated. But there's no way to inform fpzip whether they are or not. If they were correlated, then you could correctly compress this image with nx=3, ny=1080, nz=1920, nf=1. We run into this issue with zfp also, and we are planning on adding a mechanism for specifying which dimensions are correlated and having zfp extract and compress slices of the array (with no physical reordering) that are correlated in all dimensions. fpzip would benefit from a similar capability. Until there's support for this, the user has to know that the ndarray must be correlated in all but the outermost, 4th dimension (i.e., the leftmost dimension if the array uses C ordering).

william-silversmith commented 5 years ago

First, I suspect that the affinity data you're using is symmetric...

You're correct that I was using a symmetric block (sorry, I had forgotten how many problems that had previously posed to me). I reran the experiment after chopping the block into asymmetric sizes (img = img[:115,:107,:,:]) and it appeared to work. The default test suite contains assymetric sizes as well and tests for identical recovery, though it doesn't check compression ratio.

Can you use a data set that is not symmetric and whose dimensions all differ so that the effects of different orderings can better be assessed?

I think I need to do a little reading to understand how to make a decent synthetic data source. This seems like a useful test to include.

Ideally the user should not have to specify the order.

I think numpy often provides order "A" (automatic)as the default. However, the decompression routine needs to know which way the original array was compressed. Automatic selection makes the situation mysterious. If a user compressed their Fortran ordered array with the 'A' option. They would achieve a higher compression factor, but when they decompress the array, the output would be represented incorrectly. Without a change to the fpzip encoding format or some kind of file metadata, it's not possible to appropriately remediate the issue.

This would be greatly facilitated if fpzip allowed for the specification of strides, as zfp does, and may be something I will add to the next release.

Without the ability to change how fpzip strides internally, it seems that there are a limited number of cases that the wrapper can handle:

1) Array in C order accessed in C order (arr[x,y,z]) 2) Array in F order accessed in F order (arr[x,y,z]) 3) Array in C order accessed in F order (arr[z,y,x]) 4) Array in F order accessed in C order (arr[z,y,x]) 5) 3D F order array with uncorrelated single slices:
(X, Y, Z, 1) vs (X, Y, 1, Z) vs (X, 1, Y, Z) vs (1, X, Y, Z) It seems that only the first two options are very significant. 6) 3D C order array with uncorrelated single slices (you document the issues with this scenario above).

The first four cases are handled well by allowing the manual specification of C or F order. Case (5) could be handled by providing a "single_slices=True" parameter that would interchange nz and nf in 3D Fortran arrays (but not C arrays). Unfortunately, there would be no record of this manipulation in the encoding, so if input to compression is (nx, ny, nz, 1) then the output of decompression is (nx, ny, 1, nz) which means np.all( input == output ) returns False. Nonetheless, forcing manual specification of that flag hopefully means that the user understands what they're doing. Alternatively, the user can do array.reshape( (array.shape[0], array.shape[1], 1, array.shape[2]) ) explicitly which is more intuitive for understanding the decompression output, but less intuitive for understanding the input. It's less discoverable than the flag, and the flag doesn't prevent that strategy, so I lean towards the flag.

william-silversmith commented 5 years ago

Sorry, this was rather undemocratic of me, but I merged the patch under discussion and released fpzip 1.1.3. My lab was running up against a deadline and we discovered that the patch was needed to make our compression process work correctly. This patch is rather limited in scope (applying order) and doesn't preclude some of the ideas we are discussing such as setting strides or inverting the third and fourth index. If this patch causes problems, please let me know and I'll do my best to accommodate them.