mhe / pynrrd

Simple pure-python module for reading and writing nrrd files.
https://pynrrd.readthedocs.io/
MIT License
116 stars 51 forks source link

Add parameter for changing index order #87

Closed simeks closed 5 years ago

simeks commented 5 years ago

This is a proposal for the problem discussed in #75. This should probably be seen as a work-in-progress and a base for further discussion. I'm not sure whether this is the best approach so feedback would be appreciated.

The first addition is the index_order argument for nrrd.read which allows the user to specify the desired index order. I set the default to 'F' so the behavior should stay the same as with previous versions.

I also changed nrrd.write. This function did previously not work on C-contiguous arrays and this should now be fixed. Here I added an index_order flag, but my hope is that it should only be required in a few cases. Similar to most numpy functions there are three options; F, C, and A. The A (default) option automatically detects the order of the array. So for example, when trying to write an array that was read using nrrd.read(index_order='F'), the function should correctly derive the shape and write the array in Fortran order.

At first sight I see only one problem with the A option and that is that it breaks down if you do something like this:

arr = nrrd.read('in', 'F')
arr = np.ascontiguousarray(arr)
np.write('out', arr)

Since np.ascontiguousarray will keep the shape but change the underlying memory order, changing it to C. This is maybe not something we should worry about though.

Also, to by knowledge there is no way to do parameterized tests using the standard unittest module so I did some jury-rigging to test the index_order argument. This could probably use some clean-up.

addisonElliott commented 5 years ago

I just have some general comments for now. I will need to sit down and vigorously test this later.

However, I think the general implementation looks good and I think the general concepts seem right.

Comments

Also, let me make sure I'm understanding this correctly. The index_order parameter, as it states, only affects how the indices are represented. Python typically uses C-order but we are basically just transposing the data to make it Fortran-order.

Thus, for nrrd.write, the index order is just specifying whether we transpose the data again before saving?

Sounds like index_order=C should be the recommended approach. Probably should include some notes so people can be aware.

simeks commented 5 years ago
  • First, we need to fix the tests which from what I saw, it's a minor issue.

Yeah, I don't really understand what went wrong there since it only seems to fail on Python 2.

  • One concern of mine is that using index_order=F will be less efficient to loop through the slowest-varying axis (e.g. the z axis) than if using index_order=C. Although our current method is a bit hacky by reading the image as Fortran-order even though it is C-order to transpose the axes, doesn't it maintain efficiency when looping about the z axis?

Practically this should produce the same result as the current method. The transpose is not actually changing the layout of the data in memory, it is simply reversing the strides and flagging the array as Fortran contiguous. The same thing is happening when you're doing np.reshape(data, tuple(header['sizes']), order='F'). The x-axis (Given x, y, z in Fortran or z, y, x in C) will still be the fastest in memory.

  • I like how you modified the tests to run in C-order and Fortran-order. I'm surprised you didn't have to transpose any of the expected_data in the nrrd.read tests? Did I miss something?

I did change expected_data by setting the order to self.index_order when first reading it from the file. Still a bit unsure whether there are any implications by doing this in the test though. It seemed to work when I did some tests on real data.

  • Since the index_order parameter is likely to be a confusing topic, I think we should have more specialized tests in addition to the ones you've added. Not sure if this is necessary, will need to think on it whether the tests now fully test the capabilities of the index_order parameter.

Yes, definitely. One thing that hit me is that the A option in the writer still is untested. Writing a test for that could be a bit confusing though.

  • Does line 290 raw_data = data.tostring(order=index_order) do what we expect it to?

Hm, I would hope so. tostring writes the element in the order specified by index_order. Given an array of shape (W,H,D), F would produce the results with the W axis being the fastest, independent of the underlying memory layout. C would do the opposite. A on the other hand will write in C order unless the array is flagged as Fortran contiguous.

  • For nrrd.write, do we want to use np.isfortran? I thought that determines the memory order. Don't we just care about the index order?

Yes, this is a bit weird and not really intuitive for me neither. The problem is that numpy is always in C index order so there's no real way to distinguish the index orders. However, this follows the way numpy seems to deal with order using the A option. The alternative would be to force the user to specify either F or C which could cause a lot of confusion. That said, there is definitely room for discussion here.

Also, let me make sure I'm understanding this correctly. The index_order parameter, as it states, only affects how the indices are represented. Python typically uses C-order but we are basically just transposing the data to make it Fortran-order.

Thus, for nrrd.write, the index order is just specifying whether we transpose the data again before saving?

Exactly, and the A option was an attempt to make the write function automatically detect the index_order.

Sounds like index_order=C should be the recommended approach. Probably should include some notes so people can be aware.

In my eyes, definitely. Especially since numpy uses C order. I understand the importance of backward compatibility though.

addisonElliott commented 5 years ago
  • First, we need to fix the tests which from what I saw, it's a minor issue.

Yeah, I don't really understand what went wrong there since it only seems to fail on Python 2.

Try to run the tests locally on Python 2 if you're able. I can rerun the Travis CI if it was just a freak occurrence that it failed.

  • One concern of mine is that using index_order=F will be less efficient to loop through the slowest-varying axis (e.g. the z axis) than if using index_order=C. Although our current method is a bit hacky by reading the image as Fortran-order even though it is C-order to transpose the axes, doesn't it maintain efficiency when looping about the z axis?

Practically this should produce the same result as the current method. The transpose is not actually changing the layout of the data in memory, it is simply reversing the strides and flagging the array as Fortran contiguous. The same thing is happening when you're doing np.reshape(data, tuple(header['sizes']), order='F'). The x-axis (Given x, y, z in Fortran or z, y, x in C) will still be the fastest in memory.

Yeah, you're probably right. Just to be safe, I might play around with this and ensure the two options give the exact same strides.

  • I like how you modified the tests to run in C-order and Fortran-order. I'm surprised you didn't have to transpose any of the expected_data in the nrrd.read tests? Did I miss something?

I did change expected_data by setting the order to self.index_order when first reading it from the file. Still a bit unsure whether there are any implications by doing this in the test though. It seemed to work when I did some tests on real data.

Hmm, this is where things seem a bit odd. The order parameter for the Numpy constructor is referring to the memory order, right? Maybe it would be better to explicitly transpose the data. This would be less likely to confuse people if they're reading the tests. And it will make a clear distinction that we are only altering the index order, not the memory order.

  • Does line 290 raw_data = data.tostring(order=index_order) do what we expect it to?

Hm, I would hope so. tostring writes the element in the order specified by index_order. Given an array of shape (W,H,D), F would produce the results with the W axis being the fastest, independent of the underlying memory layout. C would do the opposite. A on the other hand will write in C order unless the array is flagged as Fortran contiguous.

I'll need to do a bit more research on this, from what I've found there are some cases where the order parameter refers to the memory order and other cases where it refers to the index order. In numpy.tostring, there are three options: C, F, orNone. WhenNoneis given,tostringwill check theF_CONTIGUOUSflag, basically doing the same thing asnumpy.isfortran`.

The dangerous part here is that what if the user has a C-ordered array that they are treating as a Fortran array. A lot of users aren't even paying attention to what memory-order their data is in. For example, let's say I'm used to MATLAB so I want to created a 5 x 10 x 15 array (x, y, z). Here's how I would do it: x = np.zeros((5, 10, 15)). Well, isn't this C-ordered actually? So this would trick np.isfortran and potentially mess up things here I think...

Also, let me make sure I'm understanding this correctly. The index_order parameter, as it states, only affects how the indices are represented. Python typically uses C-order but we are basically just transposing the data to make it Fortran-order. Thus, for nrrd.write, the index order is just specifying whether we transpose the data again before saving?

Exactly, and the A option was an attempt to make the write function automatically detect the index_order.

If we want to truly separate the index order concept from the memory order, then I'm not sure a Numpy array will contain any information on the index order because that is arbitrary. So, I don't think an automatic order parameter is possible. Rather, my vote would be to make the index_order default to 'F' for the read and write functions. Then, at the release of 1.0 maybe we will switch that to 'C'.

This is what users would have to do for C-order which is kinda a pain, but not that bad really.

data, header = nrrd.read(...., index_order='C')
...
nrrd.write('xxx', data, header, index_order='C')
codecov-io commented 5 years ago

Codecov Report

Merging #87 into master will increase coverage by <.01%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #87      +/-   ##
==========================================
+ Coverage   99.45%   99.46%   +<.01%     
==========================================
  Files           6        6              
  Lines         368      374       +6     
  Branches      116      119       +3     
==========================================
+ Hits          366      372       +6     
  Misses          1        1              
  Partials        1        1
Impacted Files Coverage Δ
nrrd/reader.py 100% <100%> (ø) :arrow_up:
nrrd/writer.py 98.33% <100%> (+0.02%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 77a0af6...4f301d3. Read the comment docs.

simeks commented 5 years ago

Try to run the tests locally on Python 2 if you're able. I can rerun the Travis CI if it was just a freak occurrence that it failed.

Odd, I thought I solved the issue this time since the tests seem to pass when I run them on Python 2. I have to look into this a bit further. First, it seemed like same warnings won't trigger twice on Python 2 unless you tell them to.

Edit: nevermind, the test actually passed. It was just Codecov that complained.

Hmm, this is where things seem a bit odd. The order parameter for the Numpy constructor is referring to the memory order, right? Maybe it would be better to explicitly transpose the data. This would be less likely to confuse people if they're reading the tests. And it will make a clear distinction that we are only altering the index order, not the memory order.

As far as I know, the order in this case only refers to the order you parse the linear data read from RAW_DATA_FILE_PATH. I.e., should x or z be the fastest axis. You're right about clarity though so it would probably be a good idea to do something like this:

if index_order == 'F':
    data = data.T

The dangerous part here is that what if the user has a C-ordered array that they are treating as a Fortran array. A lot of users aren't even paying attention to what memory-order their data is in. For example, let's say I'm used to MATLAB so I want to created a 5 x 10 x 15 array (x, y, z). Here's how I would do it: x = np.zeros((5, 10, 15)). Well, isn't this C-ordered actually? So this would trick np.isfortran and potentially mess up things here I think...

If we want to truly separate the index order concept from the memory order, then I'm not sure a Numpy array will contain any information on the index order because that is arbitrary. So, I don't think an automatic order parameter is possible. Rather, my vote would be to make the index_order default to 'F' for the read and write functions. Then, at the release of 1.0 maybe we will switch that to 'C'.

This is what users would have to do for C-order which is kinda a pain, but not that bad really.

data, header = nrrd.read(...., index_order='C')
...
nrrd.write('xxx', data, header, index_order='C')

Yeah, as mentioned in my first post, I realized that you could quite easily break things by doing something as simple as np.ascontiguousarray. My thought with A was to allow people to use it without thinking but that may be a bad idea. I know from experience that users (including me) are lazy :smile:.

So, I agree, forcing F or C makes sense but this has to be clearly described in the documentation. Hopefully the users who find out about index_order='C' actually found it there and therefore know about the requirements for using nrrd.write. This shouldn't matter as much for the F users since this will continue to be the default behaviour.

addisonElliott commented 5 years ago

Well, I think we're in agreement for what to do now. I'll leave this PR up to you, but let me know if you have questions or need help. Also, may be useful to put [WIP] in the PR title until you're done. Just mention me when you are ready for a review. I will grab this branch and try to do a thorough testing of it before merging it in.

By the way, I probably need to add a threshold to the codecov diff. Having a greater coverage does not mean that your code is tested the best.

@ihnorton @tashrifbillah @AAAArcus Any comments or concerns? This PR will not break any of your code, but if you want to utilize the Python-default C-order index ordering for the data only, a new argument named index_order can be passed to essentially transpose the data in nrrd.read and nrrd.write.

A future version, potentially v1.0.0 whenever that will be, will set the default to use C-order which will break code but ultimately is easy to fix.

simeks commented 5 years ago

@addisonElliott I think this should be all, I hope I'm not forgetting anything. I made sure to clarify this in the user manual as well.

simeks commented 5 years ago

The Github UI is a mess, but I think I covered all the comments.

simeks commented 5 years ago

:+1:

I want to take another look at the unit testing to make sure we're thorough. I think we are, but I just want to make absolutely sure the tests are running as expected.

Yes, of course. There are other approaches as well, like https://pypi.org/project/parameterized/, but I didn't want to add additional dependencies.