Error in harmonizationApply due to site differences between training and testing data

Yuvi-416 commented 2 months ago

Hey there,

Thank you for your exciting work.

I recently used this package for my work, and now I am a little bit confused about how to use it or whether the process I used is correct. I am doing a regression analysis, and the target prediction is Age. So, I collected healthy data from nine different sites, and then we had patients' data from two different sites.

The first question I want to ask is, do I need to apply harmonization to the testing data as well? I have already applied harmonization to the training data using the function below:

combat_model, features_train_combat = harmonizationLearn(features_train, covariates_train) # smooth_terms=['Age'] When I pass the combat_model, which I get from the above function to the testing data, I get an error: features_test_combat = harmonizationApply(features_test, covariates_test, combat_model)

ERROR: IndexError: index 4 is out of bounds for axis 1 with size 4 I think I am getting this error because of the site difference between the training and testing data, as the training sample has nine sites, and the testing data has two sites. So my next question is, do I need to apply harmonization separately to each training and testing dataset?

I could not find any information about how to proceed or whether we should apply harmonization to all (training and testing) at once or apply it separately.

Can you suggest what I should do?

Thank you!

rpomponio commented 2 months ago

Thanks for submitting this issue. It's crucial to know whether your two testing sites are included within the nine training sites, or if the sites are entirely disjoint between training & testing.

Yuvi-416 commented 2 months ago

Hi there, Thanks for reply. No, testing sites are not included in the training sample. I first use harmonizationLearn function to harmonize training data which has batch as 1-9 as I use data from 9 database, so each number for each data. And then applied learned combat_model model from training to harmonizationApply to testing data whose batch is as 1-3.

And after that I getting above mentioned error.

I found the below line in the document which pretty much explains why I am getting the error. “ Next, prepare the holdout data on which you will apply the model. This data must look exactly like the training data for harmonizationLearn, including the same number and order of covariates. If the holdout data contains a different number of sites, an error will be thrown.“

However, I would like to know in my case what would be the feasible way to perform data harmonisation. I have two CSV files, one for training and another for testing, and in each file has SITE name in number 1-9 for training and 1-3 for testing and there is also a Age colum in each file which I use as a covarites.

I wanted to know whether do I need to concatenation both file and apply harmonisation of them or do it separately.

rpomponio commented 2 months ago

This method is not designed to harmonize sites that are not part of the training data.

That said, you should be able to fix the error by including all sites in the training and testing sets. Typically, users will designate a subset of their data (i.e., healthy controls) to train the harmonization model. Then, they will apply the model to all data (i.e., patients and healthy controls).

If some of your sites only contain patients, then unfortunately this method will not be suitable. This is a known limitation of statistical harmonization methods and I am not aware of a method that appropriately addresses this situation.

I am going to close this issue, but please feel free to re-open it if you have further questions.

rpomponio / neuroHarmonize

Error in harmonizationApply due to site differences between training and testing data #50