samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
672 stars 240 forks source link

bcftools convert #1645

Closed eastbourne closed 2 years ago

eastbourne commented 2 years ago

I am running the following command to convert a 23andme file to a vcf file.

bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -Oz -o out.vcf.gz I have ensured that the 23andme file is tab-separated. I am able to get some output from bcftools but I am unsure if the program is working correctly. The head of my original file is:

rs12124819  1   776546  --
rs12127425  1   794332  GG
rs79373928  1   801536  TT
rs7538305   1   824398  AA
rs28444699  1   830181  AA
rs116452738 1   834830  GG
rs72631887  1   835092  TT

and I am getting the output as:

CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SampleName
1   776546  rs12124819  A   .   .   .   .   GT  ./.
1   794332  rs12127425  G   .   .   .   .   GT  0/0
1   801536  rs79373928  T   .   .   .   .   GT  0/0
1   824398  rs7538305   A   .   .   .   .   GT  0/0
1   830181  rs28444699  A   .   .   .   .   GT  0/0

Is the REF and ALT correct? They do not show the same information as the original file. I don't know if I am missing something else here? Thank you in advance for any help.

pd3 commented 2 years ago

The output is correct. The homozygous reference genotypes are correctly written as 0/0 and there are no alternate alleles. What do you think is the problem here?

eastbourne commented 2 years ago

Thank you for your reply. I think I was expecting each allele in the REF ALT columns, such as the following output:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SampleName
1   13380   rs571093408 C   G   .   PASS    RefPanelAF=7.69941e-05;AN=2;AC=0;INFO=1.24993e-09   GT:ADS:DS:GP    0|0:0.05,0.05:0.1:0.9025,0.095,0.0025
1   16071   rs541172944 G   A   .   PASS    RefPanelAF=0.000123191;AN=2;AC=0;INFO=1.24993e-09   GT:ADS:DS:GP    0|0:0.05,0.05:0.1:0.9025,0.095,0.0025
1   16141   rs529651976 C   T   .   PASS    RefPanelAF=0.000138589;AN=2;AC=0;INFO=1.24993e-09   GT:ADS:DS:GP    0|0:0.05,0.05:0.1:0.9025,0.095,0.0025
1   16280   .   T   C   .   PASS    RefPanelAF=0.00066215;AN=2;AC=0;INFO=1.24993e-09    GT:ADS:DS:GP    0|0:0.05,0.05:0.1:0.9025,0.095,0.0025
1   49298   rs200943160 T   C   .   PASS    RefPanelAF=0.640145;AN=2;AC=2;INFO=9.82494e-10  GT:ADS:DS:GP    1|1:0.65,0.65:1.3:0.1225,0.455,0.4225

My goal is to have VCF files to run on the Michigan Imputation Server. Is there a way to get files in the format I expect? Thank you

pd3 commented 2 years ago

The file is a correctly formatted VCF. Are you saying it is not accepted by the imputation server?

As for filling the ALT allele, that's not possible at the moment, even though there was an attempt to add this functionality in the past https://github.com/samtools/bcftools/commit/0792ae8b91e1efb7f0c904e90a4771944a9cc7c8.

eastbourne commented 2 years ago

I run the job again and inspected the results. The file is accepted by the Imputation Server, but somehow the Server is discarding many sites. So I am not sure if that's because of the format or parameters in the server.

Excluded sites in total: 451,287
Remaining sites in total: 147,340
See snps-excluded.txt for details
Typed only sites: 5,091
See typed-only.txt for details
pd3 commented 2 years ago

Why don't you look which sites were excluded and see if they are all ALT=.?

eastbourne commented 2 years ago

Indeed, those are ALT = .

#Position   FilterType   Info
1:69869:T:. Invalid Alleles
1:565508:G:.    Invalid Alleles
1:727841:G:.    Invalid Alleles
1:754105:C:.    Invalid Alleles
1:759036:G:.    Invalid Alleles
1:776546:A:.    Invalid Alleles
1:794332:G:.    Invalid Alleles
1:801536:T:.    Invalid Alleles
1:824398:A:.    Invalid Alleles
1:830181:A:.    Invalid Alleles
1:834830:G:.    Invalid Alleles
1:835092:T:.    Invalid Alleles
pd3 commented 2 years ago

It is now possible to transfer ALT from one VCF into another https://github.com/samtools/bcftools/commit/f6047f8eb4bf9a74cda6adafff5bf5c32f723483, e.g. as

bcftools annotate -a annots.vcf.gz -c +ALT file.vcf.gz

I hope this will help to resolve the issue.

pd3 commented 2 years ago

I believe this issue can be marked as resolved now

stephanc0 commented 7 months ago

What server are you using? I tried using the Michigan Imputation Server to impute my 23andMe genome, processed identically to yours, and got a message that there was a minimum of 20 genomes required for imputation. Thanks!