statgen / hds-util

Mozilla Public License 2.0
2 stars 0 forks source link

invalid genotypes #1

Open brettva opened 1 year ago

brettva commented 1 year ago

Thank you for developing this tool, it will be quite handy for us.

In my merges I have been getting invalid genotypes (eg 0/-44) in addition a mixture of phased and unphased sites.

I imputed some publicly available HGDP samples on MIS to demonstrate this issue here

Do you advice on how to proceed? Hopefully I am not doing something silly. Thanks

jonathonl commented 1 year ago

Thanks for providing this example. I'm not seeing the same output as you. What operating system are you running on (and what operating system did you compile hds-util on)? I'd like to reproduce your environment.

brettva commented 1 year ago

@jonathonl Thank you so much for getting back to me so fast. I tried both compiling and running independently on the csg and armis clusters and seem to see the issue in both envs

I am not sure what details would be most helpful for you, but here are a few:

csg:

lsb_release -a:

LSB Version:    core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal

gcc --version: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

armis:

lsb_release -a:

LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterprise
Description:    Red Hat Enterprise Linux release 8.6 (Ootpa)
Release:        8.6
Codename:       Ootpa

gcc --version gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)

Not sure if it matters but I believe in both cases the version of sav that was available at time of compiling was:

sav v2.1.0

If you need any other info please let me know

jonathonl commented 1 year ago

This should now be fixed with https://github.com/statgen/hds-util/commit/763bb2d62b53654b67d7e678d99ad309abe43d0f. Please rebuild with latest from master branch.

brettva commented 1 year ago

@jonathonl I really appreciate the time you put into fixing that, especially so fast. It looks better on my end now.

Another quick question, and sorry if I am missing it somewhere. We certainty want MAF and Rsq recomputed in our merged data, but what is the point of recomputing DS, GT , GP from HDS?

Is it just so that these numbers can be recapitulated from the HDS that appears in the VCF? Regardless is it always recommended to update DS, GT , GP with -f DS, GT , GP when merging, iiuc this issue at least Rsq is originally based off more precise HDS than what is seen in the VCF.

jonathonl commented 1 year ago

It's recomputed for the sake of simpler code. There is a plan for future versions of the imputation server to only export HDS in the output files in order to reduce compute and storage costs. Most people don't need all four FORMAT fields, so hds-util allows you to generate only the fields needed by a user for downstream analysis.

In the latest version of Minimac4, the Rsq is computed after the precision loss. But Imputation Server is still using the older version so that issue would still apply. In any case, the median difference is quite small and I suspect it would have negligible effects on Rsq filtering strategies.

brettva commented 1 year ago

@jonathonl That makes a lot of sense thanks