nipy / nibabel

Python package to access a cacophony of neuro-imaging file formats
http://nipy.org/nibabel/
Other
649 stars 258 forks source link

RFC: get_data deprecation #834

Closed effigies closed 4 years ago

effigies commented 4 years ago

I've noticed that nilearn and nistats are working around the deprecation of img.get_data() with a public get_data() function.

It would be nice to hear what the specific use cases are that make this preferable to moving to get_fdata() or np.asarray(img.dataobj). Is it just that the latter is a pain to type? The loss of caching?

I think there are at least three possibilities:

1) Continue as-is, and nilearn will have a get_data() helper function 2) Un-deprecate img.get_data() 3) An alternative method that covers the non-get_fdata() parts of get_data()

What makes sense to people?

And just process-wise, it would be nice to know how we can get feedback before pushing people to work around changes to the API. I'm mostly aware of this because I'm watching the nistats repository, but the planned get_data() replacement and deprecation has been on the mailing list and in the documentation for a couple years.

cc @jeromedockes @GaelVaroquaux @bthirion @kchawla-pi @nipy/team-nibabel

GaelVaroquaux commented 4 years ago

I've noticed that nilearn and nistats are working around the deprecation of img.get_data() with a public get_data() function.

Indeed, nibabel's API is a moving target, and it's frequent modifications break users' script and documentation available across internet.

It would be nice to hear what the specific use cases are that make this preferable to moving to get_fdata()

Number one problem is moving target. We are unfortunately progressively abstracting out nibabel's API because changes of behavior are difficult to follow. In particular, they end up corrupting example scripts with switches to select functions depending on the nibabel version. This goes against our work to keep those scripts as simple as possible.

It would be good to really try very hard to not deprecate things. The Internet has a "memory" of multiple years: outdated answers on stack-overflow like sites, old tutorials, slides, code for articles. Each deprecation sends ripples that take several years to get over and our code is full of shims due to supporting multiple versions of nibabel.

Beyond the problem of moving target, one thing that I move worry (but maybe this has been discussed) is the output dtype. We are more and more moving towards default to float32. Indeed, with the increasing size of data, memory usage is more and more a problem. In addition, we have found that there is seldom enough SNR in neuroimaging data to require float64.

matthew-brett commented 4 years ago

Gael - I know if get_data, get_fdata in terms of API change. Were you thinking of something else?

Indeed, we do try very hard not to deprecate. You probably remember the discussions from 2015 to mid 2017 where we hashed this through. At the time the idea was to:

jeromedockes commented 4 years ago

Beyond the problem of moving target, one thing that I move worry (but maybe this has been discussed) is the output dtype. We are more and more moving towards default to float32. Indeed, with the increasing size of data, memory usage is more and more a problem. In addition, we have found that there is seldom enough SNR in neuroimaging data to require float64.

I think that even more compelling is the case of atlases or masks, where we want the data to retain its integer type.

Another change is getting a copy rather than a reference to the image's cache, although we should always create a new image rather than modify this data array in-place.

matthew-brett commented 4 years ago

Yes, there will surely be times when you want to control the output dtype for memory purposes, or because you really want integers. We can go back through it, if it's helpful, but the discussion up til 2017 was about whether output data type should depend on the presence of scaling in the header, which is often arbitrary; most image libraries don't intend for scaling to be interpreted that way. We decided it should not, hence the change.

I am not sure what you're referring to with the cache / copy change. get_fdata continues to return a cached version, like get_data.

jeromedockes commented 4 years ago

I think there are at least three possibilities:

1) Continue as-is, and nilearn will have a get_data() helper function 2) Un-deprecate img.get_data() 3) An alternative method that covers the non-get_fdata() parts of get_data()

What makes sense to people?

I think that solution 2) would have been best, but now that we have added get_data we may hold on to it, because for example it works on filenames or lists of filenames or glob patterns.

And just process-wise, it would be nice to know how we can get feedback before pushing people to work around changes to the API. I'm mostly aware of this because I'm watching the nistats repository, but the planned get_data() replacement and deprecation has been on the mailing list and in the documentation for a couple years.

I cannot speak for others, but for my part I simply missed this discussion. I was less involved in nilearn at the time, and I must admit I do not often read the nibabel documentation as I almost exclusively interact with nibabel through nilearn. thus I only became aware of the deprecation when this PR was open in nilearn: https://github.com/nilearn/nilearn/pull/2155 Then I did not realize that un-deprecating get_data was an option, though in retrospect reviving this discussion on the nibabel repo should have been the first thing to do.

np.asarray(img.dataobj) would have worked (we just would have needed to complete a bit the interface of some nilearn mocks that mimick an Nifti1Image), but either way I think we would have wrapped this in a function.

In any case, the problem was not having to work around the interface, but having to change many lines of code in the nilearn modules and examples (and users will have to do the same). No matter how long the deprecation stays in the documentation, if it is a breaking change at some point users will pay the cost of editing (in this case hudreds of lines of) code. Again, I am sorry for not participating in the nibabel discussion when it happened, nor trying to revive it when the deprecation cycle started before fixing it in nilearn. Solution "(2) un-deprecate get_data" would have worked perfectly for nilearn.

jeromedockes commented 4 years ago

I am not sure what you're referring to with the cache / copy change. get_fdata continues to return a cached version, like get_data.

sorry, scratch that part then

matthew-brett commented 4 years ago

Unfortunately, as I'm sure you know, the low-level details often mean that the option that seems easiest for the higher-level libraries is a bad option for the long term - and I think get_data is one such instance. For the reasons, I can point you to the discussion, if you're interested.

I personally think that undeprecating get_data would be a terrible idea, especially at this stage. It's results were extremely confusing to beginners, and some non-beginners, and it is only possible to understand why it generates what it does, when you know some nasty low-level details of the Nifti format, that vary more or less randomly across producers of Nifti images.

I think this is the only instance we have of a 'moving-target' API, at least, of any significance, but I'm happy to be corrected.

jeromedockes commented 4 years ago

I am not sure what you're referring to with the cache / copy change. get_fdata continues to return a cached version, like get_data.

sorry, scratch that part then

although that point remains for the np.asarray(img.dataobj, dtype=...)

matthew-brett commented 4 years ago

Having said that, I could imagine adding get_data back with a more scary name, like get_data_hdr_scaling or something.

jeromedockes commented 4 years ago

Unfortunately, as I'm sure you know, the low-level details often mean that the option that seems easiest for the higher-level libraries is a bad option for the long term - and I think get_data is one such instance. For the reasons, I can point you to the discussion, if you're interested.

I saw this explanation: https://github.com/nipy/nibabel/wiki/BIAP8 , if this not what you mean I am interested. I understand the motivation for preferring get_fdata, and I guess we should add a dtype argument to nilearn.image.get_data for the same reasons. Still, we will retain the possibility to avoid casting to float for atlases, masks and ROIs.

I personally think that undeprecating get_data would be a terrible idea, especially at this stage.

Nilearn and Nistats do not rely on get_data anymore; the deprecation has been handled for these two libraries.

jeromedockes commented 4 years ago

Beyond the problem of moving target, one thing that I move worry (but maybe this has been discussed) is the output dtype. We are more and more moving towards default to float32. Indeed, with the increasing size of data, memory usage is more and more a problem. In addition, we have found that there is seldom enough SNR in neuroimaging data to require float64.

note that we can use img.get_fdata(dtype='float32')

matthew-brett commented 4 years ago

The BIAP is the main source, but there's a lot of discussion on the linked mailing list thread, and a little more at https://mail.python.org/pipermail/neuroimaging/2017-April/001383.html

GaelVaroquaux commented 4 years ago

I think this is the only instance we have of a 'moving-target' API, at least, of any significance, but I'm happy to be corrected.

I think that you are correct in the sense that the moving target that has been bothering me over the years is how to retrieve the array data of a NiftiImage. However, for our uses this is the main API of nibabel that we use.

My memory is very blurry, as this problem has not been my main focus over the year. However, if I remember right, we started by img.data being the recommended way of accessing data, then moved to img.get_data and now img.get_fdata. I also remember _data, and dataobj, but I do not remember how they relate to the other APIs. The point being that over the years, we have had difficulties giving a consistent and simple story to the users on how to retrieve the data.

matthew-brett commented 4 years ago

Yes, that would be rather moving-targetey - but I don't think that's right - it was always .get_data() - see relevant line from first release in 2010. There was a private attribute _data, that was obviously not recommended, and later we added the dataobj attribute as a lower-level public API, for them as wanted efficient access of various sorts - see https://nipy.org/nibabel/images_and_memory.html - but we always recommended .get_data, up until the change to get_fdata.

effigies commented 4 years ago

Hi Gael, Jerome, thanks for responding. I really appreciate it.

I appreciate that such a major deprecation is painful, and we do (and will) work hard to avoid this as much as possible. I'm not convinced that there's a compelling case to undeprecate get_data(), and will proceed for now with the deprecation as planned. Given that you've already separated out get_data(), reverting that seems gratuitous, and keeping it in nilearn will allow you to improve on the old nibabel API when it comes to issues like nilearn/nilearn#2209.

To try to communicate better going forward, I wonder if you have any suggestions as to how we can make the documentation clearer? If nilearn developers are having issues finding what you need, I'm sure there are many users who are having greater difficulties.

And if there are parameters we can tweak in our deprecation process that would make life easier when we do have deprecations, please let me know.

As one of the dominant layers between users and directly using the library, I'd like to invite you to open issues here or on the neuroimaging@python.org mailing list if you have suggestions for general API.

GaelVaroquaux commented 4 years ago

Yes, that would be rather moving-targetey - but I don't think that's right - it was always .get_data()

You must be right. I am sorry that I was confused. We were probably using the wrong API in the beginning.

pauldmccarthy commented 4 years ago

I don't have any real opinions on the deprecation/removal, but I am interested in what you all think should be "best practice" for loading an image from disk, modifying it, and saving it back to disk.

I've always relied on the internal cache managed by DataobjImage objects, and returned by get_data. I've happily written code in the knowledge that both myself and nibabel have access to a single numpy array - this made modifying/saving easy - I could just modify the array directly, and call img.to_filename.

But now that this internal cache is being deprecated/removed, I can no longer do things in this way.

My workflow is probably now going to be (as already implied by @jeromedockes):

  1. Load an image file with nibabel
  2. Use np.asanyarray(img.dataobj) to get the data, and maintain my own reference to it
  3. Modify this data as needed
  4. Create a new nibabel image using the old header and the modified data, and save it to disk
  5. Discard the reference to the original nibabel image

So I'm just curious - is this what you would recommend?

(as a side note, my use case is to always want the data unscaled and in its native type, so the fact that get_fdata and its associated cache is not being removed is irrelevant).

jeromedockes commented 4 years ago
  1. Load an image file with nibabel
  2. Use np.asanyarray(img.dataobj) to get the data, and maintain my own reference to it
  3. Modify this data as needed
  4. Create a new nibabel image using the old header and the modified data, and save it to disk
  5. Discard the reference to the original nibabel image

So I'm just curious - is this what you would recommend?

I think that this is what I would do

effigies commented 4 years ago

If you always want unscaled data, then you should use img.dataobj.get_unscaled(), not np.array(img.dataobj) or img.get_data(), which will apply scale factors if present in the image.

You might want to wrap this with something like:

def get_unscaled(dataobj):
    if nb.arrayproxy.isproxy(dataobj):
        return dataobj.get_unscaled()
    return dataobj

But otherwise, your sequence is what I do. I'm actually somewhat surprised that modifying the cached result of get_data() and then calling to_filename on the same image object worked for you, since writing Analyze-like images (including NIfTI) reads the dataobj, not a cache:

https://github.com/nipy/nibabel/blob/d1518aa71a8a80f5e7049a3509dfb49cf6b78005/nibabel/analyze.py#L1001-L1012

pauldmccarthy commented 4 years ago

Cool, thanks for the clafiication. It looks like my previous approach (modifying the cached array from get_data) was fine with nibabel < 3, as to_file_map was using get_data:

https://github.com/nipy/nibabel/commit/c8d93e6067ec793c294d46fb1d4aee805b188a59

But it looks like I've been wrong about the scaling, although I don't often come across images with slope/inter parameters that aren't equal to 1/0...