Correcting for site as well as the effect of covariates on NIfTI images

parekhpravesh commented 3 years ago

Hello,

I want to use neuroHarmonize on NIfTI images to simultaneously correct for site and the effect of covariates (TIV, age, and sex). I have followed the examples on the intro page to read my data in and create a mask. To learn the model parameters, the example mentions:

my_model, nifti_array_adj = nh.harmonizationLearn(nifti_array, covars)

However, since I also want to correct for additional covariates, I used

my_model, nifti_array_adj, s_data = nh.harmonizationLearn(nifti_array, covars, return_s_data=True)

After that, I saved the model and applied the same to the same set of files from which I learned the model parameters. However, the resulting NIfTI images do not look corrected for the effect of covariates; rather, there seems to be only a small level of change in voxel intensities (which I assume are the site effect).

Therefore, my question is: shouldn't we be writing out s_data because that is corrected for both site as well as the effect of covariates? I did try this and the resulting NIfTI images look like residuals images typically obtained after linear regression (which seems reasonable).

Thank you for your time and help!

rpomponio commented 3 years ago

Hello!

I'm appreciative of your questions and willingness to use the neuroHarmonize package! It seems like an interesting use case that I did not originally consider, and I have a few comments:

What are your covariates? And do you expect them to have strong effect(s) on the individual voxel intensities? For example, I might expect age to have an effect on voxel intensity, but only in cohorts where the age range is quite wide (e.g. 20+years).
If you simply want to regress-out all covariates including site, you could run a linear regression (as mentioned) and obtain residualized images "free" of site and covariate effects. I'd actually prefer to do this if I did not need to specially harmonize the data.
Lastly, you could use neuroHarmonize as part of a two-step procedure. First, remove the site effects but preserve the covariate effects (this is the intended use case of my package). Second, regress-out remaining covariate effects. Be sure to examine results from both steps! I have had success with this in my previous lab.

Please keep me updated with your ideas and progress. I'm very interested in your experience with the software and if necessary, I would consider modifying the package for a new use case in the future.

-Ray

parekhpravesh commented 3 years ago

Hello,

Thank you for your reply!

My covariates are total intracranial volume (TIV), age, and sex (female - yes/no)
The images I am working with are GM VBM images where we expect a significant percent of variance to be explained by TIV, and to a certain extent by age and sex (although some of the variance will be shared between TIV and sex).

To give you some more context, we have imaging data from three different sites/scanners. The goal is to perform a VBM analysis between two groups of patients (chronic and recent patients). There is, however, a strong correlation between age and duration of illness (which is the variable of interest). Therefore, our plan is to estimate the effect of age from healthy subjects and control for the effect of age in the patient sample (by applying this model). Since, site/scanner, TIV, and sex are also going to account for some of the variance in the GM, we thought of using neuroHarmonize to learn how all these variable affect GM in a single model and then apply it to patient data. The "residual" images after applying the model would then be free of the confounding effect of these variables and then can be used for further analyses.

If I understand your second point correctly, you are suggesting that it will be better to model all covariates, including site, together as a linear regression model and obtain residuals? Perhaps I misunderstand your point...but wouldn't using harmonization better account for site effects?
Yes indeed. That's certainly one way forward. I see that you have used a similar approach in the NeuroImage paper where post harmonization, the effect of sex and ICV were regressed!

I will try a few of these permutations and reply with the results. Appreciate your time and help!

Regards Pravesh

parekhpravesh commented 3 years ago

Hello, here are some observations from various experiments:

Harmonization works well and I am able to use the harmonized images for a second-level linear regression to remove the effect of non-site confounding effects; i.e., the third option is an easy way forward
If I write out the s_data variable as NIfTI images, rather than writing out nifti_array_adj as is the case in the function applyModelNIFTIs, the images look comparable to a typical residual image from linear regression; however, these are not the same as the residuals obtained on performing linear regression on harmonized images. To further compare this, I looked at the mean of the residuals for a few voxels; for a given voxel, the mean of the s_data images across subjects is ~4.0125e-04 while for the same voxel, the mean of the residuals (after linear regression on harmonized data) is ~2.7117e-11. Similar trend exists across other voxels where the mean of the s_data is close to zero but larger than the mean of the residuals image. Perhaps these relatively subtle differences are introduced as a result of standardization (while I just mean centered my covariates)?
It appears that the threshold used for making the mask leads to some differences in the estimate of harmonization; the harmonized images from a threshold of zero and 0.1 have slightly different intensities for all voxels. I would assume that the estimate of model parameters would be performed independently for each voxel and therefore the mask threshold should not have any effect on voxels which were inside the mask in both the cases; perhaps I am missing something here?
I will soon be working with a relatively large dataset in a machine learning framework; I am interested in applying harmonization (~15 scanners/sites) in a cross-validated manner; however, I am using MATLAB for my pipeline. It seems inefficient to pass a list of NIfTI images to neuroHarmonize in each cross-validation iteration, wait for the saved harmonized NIfTI images to be written, then read them back in MATLAB for further processing. Would you have any suggestions on avoiding reading/writing images multiple times? An alternate would be to write out flattened NIfTI from MATLAB and pass that straightaway to neuroHarmonize; however, the reading of csv files becomes a bottle neck for high resolution images (with several thousand voxels).

Look forward to your thoughts. Thank you

Regards Pravesh

rpomponio commented 3 years ago

Hi Pravesh,

Glad to see you are progressing with the package! And I believe this may help to clarify some your points of inquiry:

The "third option" sounds promising to me and I would continue along that path as the main analytical pipeline.
Yes, the means will be near zero because residuals are distributed around zero. The difference between the two means you cited is negligible to me, as they are both extremely small numbers, just artifacts of a standardization procedure.
No, the harmonization is not independent across voxels and that is one advantage of running this method, which allows for information pooling. As you observed, different masking thresholds will yield different results, although the hope is that you are conservative with masking, i.e. leave all voxels which may demonstrate some biological meaning.
I'm a little confused on the last point but I would caution against mixing the harmonization and standardization steps with the ML steps. Why not handle the harmonization in one step for all images, then proceed with your ML framework?

parekhpravesh commented 3 years ago

Hello,

Yes, indeed. I am now using the third option as the main pipeline and everything seems to be working fine.

With regard to masking, I have been using a liberal threshold of zero and later for second-level analyses applying a threshold. The reason for this being that the mask I want to use is different from the mask which is created by neuroHarmonize. Hopefully the differences should be subtle enough to not matter too much.

Regarding the last point, the rationale is to keep strict independence between training and test set during cross-validation. Instead of using the entire data for harmonization, it would be better to estimate the harmonization model parameters from the training set and then apply it to test set (otherwise there is information leakage between training and test set). Of course, one could argue that this information leakage is not directly relevant to target variable but I think in general it is best to maintain strict separation between these two sets. Of course, the downside is that its computationally much more expensive now as for every cross-validation fold, a new harmonization model has to be learnt and then applied to the test set.

rpomponio commented 3 years ago

This was a great discussion and I welcome re-opening the issue if there are additional questions!

rpomponio / neuroHarmonize

Correcting for site as well as the effect of covariates on NIfTI images #10