More than two ancestries

diegoortunes commented 4 months ago

Hi, i'm testing GAUDI running it to more than two ancestry. To check this before using real data, I changed the msp files from the toy data manually, changing some numbers 0 and 1 to 2 or 3. I am not sure if the outputs were as expected, given that the results were the same for both three and four ancestries. I want to be sure if this part in the script prevents the expected results to this experiment:

#Add the HAP1 HDS dosage values from AFR haplotypes to the AFR output column.
la_var.loc[row_names[0], list(rf_vals_hap1 == 0)] = np.add(la_var.loc[row_names[0],list(rf_vals_hap1 == 0)],
                            list(it.compress(hds_hap1, list(rf_vals_hap1 == 0))))

#Add the HAP2 HDS dosage values from AFR haplotypes to the AFR output column.
la_var.loc[row_names[0], list(rf_vals_hap2 == 0)] = np.add(la_var.loc[row_names[0], list(rf_vals_hap2 == 0)],list(it.compress(hds_hap2, list(rf_vals_hap2 == 0))))  

#Add the HAP1 HDS dosage values from EUR haplotypes to the EUR output column.
la_var.loc[row_names[1], list(rf_vals_hap1 == 1)] = np.add(la_var.loc[row_names[1], list(rf_vals_hap1 == 1)],list(it.compress(hds_hap1, list(rf_vals_hap1 == 1))))

#Add the HAP2 HDS dosage values from EUR haplotypes to the EUR output column.
la_var.loc[row_names[1], list(rf_vals_hap2 == 1)] = np.add(la_var.loc[row_names[1], list(rf_vals_hap2 == 1)],list(it.compress(hds_hap2, list(rf_vals_hap2 == 1))))

Thank you in advance.

quansun98 commented 4 months ago

Thank you for your question. The direct answer to the question “if this part in the scripts prevents the expected results to this experiment” is yes. The part rf_vals_hap1 == 0 only identifies local ancestry coding 0 and will ignore other coding. Same for rf_vals_hap1 == 1, which will only recognize coding 1. Other ancestry coding will not be considered.

Get back to your ultimate goal to test GAUDI on more than two ancestry, actually it’s not a very trivial case to simply change or add the ancestry coding. Theoretically, the penalty matrix will be different and there are multiple ways to specify it. Practically, it increases the number of parameters to make inference, thus will likely make it more computational challenging and probably make the model unstable as well. We are currently working on modifying the intrinsic model to make it allow for more than two ancestries while being more computational efficient. We will share and update once it’s ready.

Thanks, Quan

HelenYSLin commented 3 months ago

I have a related question: the local ancestry in my MSP files is coded as CEU=0, YRI=1 (although I manually removed this first line to get it to run). However, seems that GAUDI assumes 0=AFR and 1=EUR? Will this inconsistency impact PRS performance?

quansun98 commented 3 months ago

It will not affect the model training or PRS performance part, as long as your training and testing individuals have consistent local ancestry codings (i.e., CEU = 0 and YRI = 1 for both training and testing sets). But it will affect the interpretation: the "AFR" ancestry-specific weights will actually be "CEU" in your case.

You can also add skiprows=1 when reading local ancestry files, i.e., rf = pd.read_csv(args.local_ancestry, skiprows=1, sep = "\t") in the local ancestry conversion code to avoid manually removing the first line. Sorry about the inconvenience!

quansun98 / GAUDI

More than two ancestries #3