[ENH]: Improve `BasePreprocessor`

synchon commented 6 months ago

[x] description of feature/fix
[x] tests added/passed
[x] add an entry for the latest changes

This PR refactors preprocess method of preprocessors to not return "data type" and only return the preprocessed input data. Having to return "data type" forced a concrete preprocessor to only operate on a single "data type" (like BOLDWarper introduced in #267) and not allow the on parameter to be exposed to the user (as requested in #301). If a concrete preprocessor adds new "data type" like BOLD_mask in the "junifer data object" as fMRIPrepConfoundRemover (introduced in #111) does, it will be handled as usual with no changes.

A preprocessor should not create new "data types" (which was allowed earlier and hence the restriction) but only create and add "helper data types" like BOLD_mask which happens via extra_input of the preprocess method.

github-actions[bot] commented 6 months ago

PR Preview Action v1.4.7 :---: Preview removed because the pull request was closed. 2024-03-13 08:50 UTC

fraimondo commented 6 months ago

I think I have an issue with this PR.

1) How is the BOLD_mask kept even before the PR? I check the fMRIPrepConfoundRemover and all we do is add it to the extra_input. But how is this kept???? Is this actually working???

2) Not allowing the preprocessor to "create new datatypes" is a bit strong. Indeed there's no difference between "new data types" and "helper data types".

synchon commented 6 months ago

How is the BOLD_mask kept even before the PR? I check the fMRIPrepConfoundRemover and all we do is add it to the extra_input. But how is this kept???? Is this actually working???

Since the input is not copied for extra_input, it's kept but if we copy it like we do for markers, it won't be kept.

Not allowing the preprocessor to "create new datatypes" is a bit strong. Indeed there's no difference between "new data types" and "helper data types".

The reason is because BOLD or T1w would be a "data type" for me but not BOLD_mask. But, we are on the same page with our understanding.

synchon commented 6 months ago

How is the BOLD_mask kept even before the PR? I check the fMRIPrepConfoundRemover and all we do is add it to the extra_input. But how is this kept???? Is this actually working???

Since the input is not copied for extra_input, it's kept but if we copy it like we do for markers, it won't be kept.

We could have the extra_input be returned from preprocess method to make it explicit though. What do you think?

fraimondo commented 6 months ago

Check this out:

This is how we "add" the BOLD_mask:

https://github.com/juaml/junifer/blob/a7bd0f6449fe9c2298501ce2d6a27349ab850190/junifer/preprocess/confounds/fmriprep_confound_remover.py#L575-L580

But this is how we treat the input in the _fit_transform function:

https://github.com/juaml/junifer/blob/a7bd0f6449fe9c2298501ce2d6a27349ab850190/junifer/preprocess/base.py#L173-L192

So basically we have out, which is input, which is also extra_input. We then pop the type, modify the extra_input and re-add the type to out.

This is madness. My madness, but madness still.

I don't know how to tackle this to make it proper.

Problems we currently have: 1) We are splitting the input, which is also the out variable in two, for no reason. 2) We are giving the preprocess function all the data (input), even if they do not need it. 3) We are "adding" new "data types" by modifying the extra_input variable.

A more declarative way would be to:

1) Only pass the input and the relevant extra_input to the preprocess function (not all the data) 2) Use the returned values to update the out. Maybe by getting a dictionary of key/value pairs?

synchon commented 6 months ago

A more declarative way would be to:

Only pass the input and the relevant extra_input to the preprocess function (not all the data)

Use the returned values to update the out. Maybe by getting a dictionary of key/value pairs?

I agree with the steps and it's similar to what I proposed in the earlier comment.

In the beginning of _fit_transform, we could copy the input to out like we do for DataReader so that we don't lose anything.
Then from preprocess we return the first value as the input dict (as we do now).
The second value from preprocess could be "helper data type(s)" dict or None and we check for it in _fit_transform. If we have it as None, we don't do anything, else we update out.

fraimondo commented 6 months ago

Perfect!

synchon commented 6 months ago

A more declarative way would be to:

Only pass the input and the relevant extra_input to the preprocess function (not all the data)

Use the returned values to update the out. Maybe by getting a dictionary of key/value pairs?

I agree with the steps and it's similar to what I proposed in the earlier comment.

In the beginning of _fit_transform, we could copy the input to out like we do for DataReader so that we don't lose anything.

Then from preprocess we return the first value as the input dict (as we do now).

The second value from preprocess could be "helper data type(s)" dict or None and we check for it in _fit_transform. If we have it as None, we don't do anything, else we update out.

@fraimondo The latest commits should implement this.

codecov[bot] commented 6 months ago

Codecov Report

Attention: Patch coverage is 96.96970% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 88.95%. Comparing base (6ff3fd3) to head (5668e63).

Additional details and impacted files

[![Impacted file tree graph](https://app.codecov.io/gh/juaml/junifer/pull/310/graphs/tree.svg?width=650&height=150&src=pr&token=5H21JuZXMw&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml)](https://app.codecov.io/gh/juaml/junifer/pull/310?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml) ```diff @@ Coverage Diff @@ ## main #310 +/- ## ========================================== + Coverage 88.90% 88.95% +0.05% ========================================== Files 103 103 Lines 4615 4610 -5 Branches 868 868 ========================================== - Hits 4103 4101 -2 + Misses 368 366 -2 + Partials 144 143 -1 ``` | [Flag](https://app.codecov.io/gh/juaml/junifer/pull/310/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml) | Coverage Δ | | |---|---|---| | [junifer](https://app.codecov.io/gh/juaml/junifer/pull/310/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml) | `88.95% <96.96%> (+0.05%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/juaml/junifer/pull/310?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml) | Coverage Δ | | |---|---|---| | [junifer/preprocess/base.py](https://app.codecov.io/gh/juaml/junifer/pull/310?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml#diff-anVuaWZlci9wcmVwcm9jZXNzL2Jhc2UucHk=) | `87.50% <100.00%> (+6.25%)` | :arrow_up: | | [.../preprocess/confounds/fmriprep\_confound\_remover.py](https://app.codecov.io/gh/juaml/junifer/pull/310?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml#diff-anVuaWZlci9wcmVwcm9jZXNzL2NvbmZvdW5kcy9mbXJpcHJlcF9jb25mb3VuZF9yZW1vdmVyLnB5) | `98.74% <100.00%> (-0.04%)` | :arrow_down: | | [junifer/preprocess/bold\_warper.py](https://app.codecov.io/gh/juaml/junifer/pull/310?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=juaml#diff-anVuaWZlci9wcmVwcm9jZXNzL2JvbGRfd2FycGVyLnB5) | `38.88% <66.66%> (ø)` | |

synchon commented 5 months ago

Tests are not passing yet, but code is OK from my side.

Was a network issue for downloading assets, nothing from our side. Let's wait and see.

juaml / junifer

[ENH]: Improve `BasePreprocessor` #310

Codecov Report