nipreps / eddymotion

Open-source eddy-current and head-motion correction for dMRI.
https://nipreps.org/eddymotion
Apache License 2.0
13 stars 16 forks source link

HDF5-backed data structure #145

Open oesteban opened 6 months ago

oesteban commented 6 months ago

Eddymotion uses large datasets and continuously needs to update them in memory, keep a large list of transforms, etc.

For that reason, we created the DWI object (which we are now generalizing so the same approach can be applied to PET or to BOLD). The idea is to have HDF5 to help minimize the memory fingerprint without damaging I/O too much.

I guess we want to get some sort of "glorified" ArrayProxy (using nibabel's jargon) to the details around a DWI dataset (data themselves, NIfTI headers readily available to generate NIfTIs for third-party tools, gradients' table, brain mask, head-motion realignment matrices, fieldmap, etc.)

Giving a common interface to all those aspects of the DWI dataset would definitely make development faster.

I would like to pick @effigies' and @arokem's brains and see how this would be implemented.

At the moment, what we have does not seem to achieve the flexibility and seamless operation that we want.

EDIT:

In #98, we are removing data splitting from the data structure object (which I think makes a lot of sense) and this problem has become apparent:

https://github.com/nipreps/eddymotion/blob/c073fdfaecc1243321cd74c72ef0c7a0d577f8da/src/eddymotion/data/dmri.py#L77-L129

In particular, my hesitations emerge from having to do this bit:

https://github.com/nipreps/eddymotion/blob/c073fdfaecc1243321cd74c72ef0c7a0d577f8da/src/eddymotion/data/dmri.py#L101-L105

every time from outside methods if we want to take advantage of the HDF5 backing.

EDIT 2:

In the long term, and if this vision works well, I can see this kind of "internal format" be propagated across nipreps and even fitlins.

WDYT?

oesteban commented 6 months ago

@teresamg, you've wrestled quite a bit with the DWI object -- what is your experience?

oesteban commented 6 months ago

@jhlegarreta & @esavary, do you have any experience with or opinion on this problem?

esavary commented 6 months ago

@jhlegarreta & @esavary, do you have any experience with or opinion on this problem?

I don't have much experience and no strong opinion, but is there a reason to not encapsulate the HDF5 backing inside the DWI object?

oesteban commented 6 months ago

is there a reason to not encapsulate the HDF5 backing inside the DWI object?

Not really. This is why I mentioned in #98 that I was not sure anymore of how this should be implemented.

We can merge #98 and roll that bit back afterward when we have a plan for this issue. I'd like to know Chris' and Ariel's opinions on #98 before we merge it though.

jhlegarreta commented 6 months ago

@jhlegarreta & @esavary, do you have any experience with or opinion on this problem?

I see the memory fingerprint concern that you mention, and I do see the HDF5 data manipulation bit that you point to as being a burden for users.

I have used HDF5 files in the past inside some wrapper class to actually provide the data.

In the long term, and if this vision works well, I can see this kind of "internal format" be propagated across nipreps and even fitlins.

My fear there is to propose yet another format that has is difficult to maintain, difficult to use across all nipreps, has a difficult fit in e.g. BIDS (maybe I'm mixing things with this last), etc. But maybe what you say about Nibabel doing something similar is already a fact (do not know enough the inner workings of Nibabel's ArrayProxy).

Also I do not know enough about memaps as another possibility.

As for the PRs I do not know the code well enough so as to weigh whether they should be merged or held now: whatever is easiest to make progress, so probably merging and leaving the HDF5 issue to be fixed for whenever it really becomes an issue. If that time has come :upside_down_face:, then I'd need to have a better grasp of the whole picture to weigh in. Apologies for that.

oesteban commented 5 months ago

@effigies - we would like to pick your brain on this one.

The design idea behind this is described in the "nipreps book": https://www.nipreps.org/nipreps-book/tutorial/intro.html#step-1-identify-an-i-o-inputs-outputs-specification

We basically wanted to have a data structure that could hold all the relevant aspects of a DWI dataset in the context of the problem. That meant it should have easy ways of storing and accessing:

The rationale to use HDF5 was to allow Python to clean up hefty objects in memory while having an easy interface to the object. In my imagination, this should work like an overlay built-in in the data structure so that the user does not need to bother about HDF5 at all (it is just used in the background).

Maybe there were better solutions like the ArrayProxy @jhlegarreta mentions above, or maybe we should be thinking about Zarr instead. I'm happy to go in any direction that makes eddymotion work better.

Let us know what you think.

arokem commented 5 months ago

Just adding that I've had some recent interesting positive experiences with Apache Arrow (see this for some examples/details). I am not sure that it's a great match for this use-case, but thought I'd mention this here as well, to add to the options we could consider.