Including Variable of Interest in Covariate Matrix?

rpomponio / neuroHarmonize

Harmonization tools for multi-site neuroimaging analysis. Implemented as a python package. Harmonization of MRI, sMRI, dMRI, fMRI variables with support for NIFTI images. Complements the work in Neuroimage by Pomponio et al. (2019).

https://pypi.org/project/neuroHarmonize/

MIT License

79 stars 28 forks source link

Including Variable of Interest in Covariate Matrix? #16

Closed Saqibm128 closed 2 years ago

Saqibm128 commented 2 years ago

Hello,

I am using neuroHarmonize for a dataset that I want to model a prediction task off of. My dataset consists of morphometric features that I already extracted from ~15 scanner sites. I included age, sex, and intracranial volume covariates, and I was going to include my target variable of diseased vs healthy as well in the covariate matrix.

I am training the neuroharmonize model only on the training set and then applying the model on a validation set using the validation set's covariates.

Is it a mistake to include the target variable in the covariate matrix in this context? Thank you

rpomponio commented 2 years ago

Hello, thank you for submitting this issue as it raises a good question.

I was going to include my target variable of diseased vs healthy as well in the covariate matrix

That is certainly one option. Alternatively, you could train the model only on healthy subjects and apply it to the rest of the dataset. Do you have enough healthy subjects to estimate the scanner effects on them alone?

Including diseased subjects in the harmonization process might raise concerns, particularly if disease effects are complicated and non-linear. In that case, the process of harmonization could confound the true disease effects you try to detect as an outcome. Ultimately, you will face a judgment call between including the diseased subjects or not, but it is best to be familiar with the downsides of both approaches.

Saqibm128 commented 2 years ago

I am trying out both approaches so far, though I might have jumped the gun with NeuroHarmonize... I might still be preprocessing some of my data, and may need to rerun analyses.

Thank you!

rpomponio commented 2 years ago

Thank you for the update @Saqibm128 . I am going to close this issue, but feel free to re-open it at any time.

rafaloz commented 1 year ago

Hello!

First of all, thanks for your paper and this implementation! I have a doubt regarding the paper https://doi.org/10.1016/j.neuroimage.2019.116450:

Although in the paper your main objective is not the brain age prediction, aren't you having the same issue? For what I understand you include age as a covariate and then predict the age, wouldn't that be data leakage? Do you learn first the transformation in training data and then apply it to test data?

pauldhami commented 9 months ago

Greetings,

I'm wondering if there is a more recent agreement on this topic? I have a dataset in which neuroimaging data was collected in patients from multiple sites. My goal is to combine all the data to conduct some regression analyses (using the neuroimaging data plus other covariates as my predictor variable, and illness severity, based on a clinical scale, as my outcome variable. So not in a machine learning/cross-validation context). I want to use neuroHarmonize, but am unsure as to whether the clinical scale data should be included. Most of the papers I've come across have included clinical status (e.g., healthy control vs patients) as a biological covariate, so intuitively, I feel as severity of illness would be okay too, but admittedly am not sure.

Any thoughts would be greatly appreciated.

rpomponio commented 8 months ago

Great question, @pauldhami . There is only so much I can say as a non-expert in your field. I would ask whether you expect the clinical scale to impact anatomical features in a linear fashion. If so, it would be reasonable to include the clinical data as covariates and harmonize with all subjects.

Otherwise, I would try to select a "reference" group of subjects who do express sub-clinical statuses. Harmonize with these (and do not include clinical status as a covariate), then apply the model to the entire cohort.