samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
661 stars 240 forks source link

Issue using bcftools fixref #919

Open idalarsson opened 5 years ago

idalarsson commented 5 years ago

Hello,

I have a very annoying problem and hope that you can help me. I have a vcf file containing appr 800 000 snps (with rs id) from 95 subjects. The vcf-file was generated from .bed, .bim and .fam-files using plink(1.9).

I want to use the Sanger Imputation service, but when trying this I got an error message saying "The input file sanity check failed, "bcftools norm -ce" exited with the following message: Reference allele mismatch at 1:3 .. REF_SEQ:'N' vs VCF:'T'". I therefore started looking into the recommendations about using bcftools to check my file.

First, I changed the chromosome-names using bcftools annotate. From what I can see this worked fine, at least when looking at the metainfo in the vcf-file.

Thereafter I ran the bcftools +fixref command to obtain stats about my file. I ran this against the recommended reference file from Sanger, i.e. GRCh37 reference fasta. I get the following stats:

SC      TOP-compatible  0
SC      BOT-compatible  0
# ST, substitution types
ST      A>C     26911   4.2%
ST      A>G     105692  16.5%
ST      A>T     18447   2.9%
ST      C>A     30270   4.7%
ST      C>G     26979   4.2%
ST      C>T     101914  15.9%
ST      G>A     118793  18.6%
ST      G>C     30370   4.8%
ST      G>T     20779   3.3%
ST      T>A     20789   3.3%
ST      T>C     114874  18.0%
ST      T>G     23405   3.7%
# NS, Number of sites:
NS      total           798885
NS      ref match       159700  25.0%
NS      ref mismatch    479523  75.0%
NS      skipped         159662
NS      non-ACGT        159662
NS      non-SNP         0
NS      non-biallelic   0

Apparently, 75% mismatched with the reference (?). Therefore I tried using the recommended allele swapping command: bcftools +fixref vcffromR.chrfix.vcf -Ob -o fixref.vcf -- -d -f human_g1k_v37.fasta -i All_20180418.new.vcf.gz

After a couple of minutes the run ended, only showing the message "Reference base mismatch at 1:101471825 .. A vs C". A file called fixref.vcf has been generated but this one is empty.

Do you have any idea what's the problem with this? I have tried solving this error in all the ways I can come up with but nothing works. Thank you in advance!

finswimmer commented 5 years ago

Which reference genome was used to create your initial vcf file?

With this high number of ref mismatches I guess hg38/GRCh38 or NCBI36 was used and not hg19/GRCh37.

fin swimmer

idalarsson commented 5 years ago

The samples were analyzed using the Affymetrix CytoScan HD array and to obtain genotypes I processed the output using the package affy2sv (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0608-y). In there I define a reference model which was obtained from the Affymetrix GeneChip Array Library Files (CytoScanHD_Array.na32.3.v1.REF_MODEL). This is the only time I define a reference model, when I converted the plink-files to vcf I only used the command: ./plink --bfile plinkfromR --recode vcf-iid --out vcffromR

I'm quite new to all this so I hope that information was that helpful in any way?

finswimmer commented 5 years ago

I'm not familiar with this. But could you show the header of the vcf file and the first 10 variants?

idalarsson commented 5 years ago

Yes, it looks like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  U3093   U3102   U3123   U3129   U3137   U3140   U3146   U3164   U3167   U3169   U3172   U3173   U3179   U3180   U3182   U3187   U3202   U3205   U3206   U3212   U3213   U3214   U3217   U3220   U3221   U3226   U3230   U3233   U3234   U3235   U3242   U3243   U3245   U3253   U3257   U3263   U3265   U3272   U3274   U3275   U3279   U3289   U3291   U3299   hAstro  Ref     U3002   U3004   U3005   U3008   U3009   U3013   U3016   U3017   U3019   U3020   U3021   U3024   U3027   U3028   U3029   U3031   U3033   U3034   U3035   U3037   U3039   U3042   U3046   U3047   U3048   U3051   U3051.1 U3053   U3054   U3056   U3060   U3062   U3065   U3068   U3071   U3073   U3078   U3082   U3084   U3085   U3086   U3088   U3101   U3110   U3117   U3118   U3121   U3123.1 U3129.1
1       5       rs6678176       A       G       .       .       PR      GT      0/0     0/1     0/1     0/1     0/1     0/1     1/1     0/1     0/1     1/1     0/0     0/0     0/0     1/1     1/1     1/1     1/1     1/1     0/1     0/0     0/1     0/1     0/0     0/0     0/0     0/1     0/1     1/1     0/1     0/1     0/0     0/1     1/1     0/1     0/0     0/0     0/1     0/0     0/1     0/1     0/0     1/1     0/0     0/1     0/1     0/0     0/0     0/1     0/1     1/1     0/1     0/1     0/1     1/1     0/1     1/1     0/1     0/0     0/0     0/1     0/0     1/1     0/0     1/1     0/0     1/1     0/0     0/1     ./.     1/1     1/1     1/1     1/1     0/0     1/1     0/0     0/1     1/1     0/1     0/1     1/1     0/1     0/0     0/1     1/1     0/1     0/1     0/0     0/1     0/1     0/0     0/0     0/1     0/1     0/1
1       7       rs78286437      C       T       .       .       PR      GT      0/0     0/0     0/1     0/0     0/0     0/1     0/0     0/0     0/0     1/1     0/0     0/0     0/0     0/1     0/0     0/1     0/0     0/1     0/0     0/1     0/1     0/1     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/1     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/1     ./.     ./.     0/0     0/1     0/1     0/1     0/1     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/1     0/0     0/0     1/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/1     0/1     0/0     0/1     0/0
1       17      rs72961022      C       T       .       .       PR      GT      0/0     0/0     0/0     0/0     0/0     ./.     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     1/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0
1       46      rs12082355      C       T       .       .       PR      GT      0/0     0/0     0/1     0/0     0/0     ./.     0/0     0/1     0/0     0/0     0/1     ./.     0/0     0/1     0/0     ./.     0/0     0/1     0/1     0/0     1/1     1/1     0/0     1/1     1/1     1/1     0/1     0/0     0/0     0/0     0/0     0/0     1/1     1/1     0/0     0/1     0/1     0/1     0/1     0/1     ./.     0/0     0/1     1/1     0/1     0/1     ./.     0/1     0/1     0/1     0/1     0/1     0/1     0/1     0/1     0/0     1/1     0/0     0/1     0/0     0/1     ./.     0/0     0/0     0/1     0/1     1/1     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     1/1     0/0     0/1     0/0     0/0     0/1     0/0     0/1     0/0     0/1     0/1     0/0     ./.     0/0
1       53      rs11166268      C       A       .       .       PR      GT      1/1     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/1     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/1     0/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     1/1     0/0     0/0     0/0     ./.     0/1     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/1     ./.     0/0     0/1     0/0     ./.     0/0     0/1     0/0     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0
1       73      rs12059609      A       G       .       .       PR      GT      1/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     1/1     ./.     0/0     0/1     0/1     0/1     0/0     0/0     0/1     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/1     0/0     0/0     1/1     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     ./.     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0
1       92      rs6657921       C       G       .       .       PR      GT      ./.     0/1     0/1     0/1     0/1     0/1     0/0     0/1     1/1     0/1     1/1     0/1     0/1     0/0     0/1     0/1     0/1     1/1     0/1     0/1     0/1     0/1     0/0     0/0     0/0     1/1     0/0     1/1     0/0     0/0     0/0     0/1     0/0     0/0     0/1     1/1     0/1     0/0     1/1     1/1     0/1     0/1     0/1     0/1     0/0     0/1     0/1     0/1     0/1     1/1     1/1     0/1     0/1     0/1     0/1     1/1     0/0     0/0     0/1     0/1     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/1     1/1     1/1     0/1     0/1     1/1     0/1     0/1     0/1     0/1     0/1     0/0     0/0     1/1     1/1     0/0     1/1     1/1     0/1     0/1     0/1
1       115     rs7514596       G       C       .       .       PR      GT      0/0     0/0     0/0     0/0     0/1     0/0     0/1     0/1     0/0     0/1     0/1     0/1     0/1     0/1     0/1     0/1     0/1     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     ./.     1/1     1/1     0/0     0/0     1/1     ./.     0/0     0/1     0/0     0/1     1/1     1/1     0/1     0/1     0/0     0/0     ./.     0/1     0/0     0/1     0/0     1/1     0/0     0/1     0/1     0/1     0/1     0/0     0/0     0/0     0/1     0/0     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/1     1/1     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/1     0/1     0/1     0/0     1/1     0/0     0/1     0/0     0/0     0/0     0/0     0/0
1       117     rs7530721       G       A       .       .       PR      GT      0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0
1       130     rs10875236      G       C       .       .       PR      GT      ./.     0/1     0/1     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/1     0/1     ./.     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/1     0/1     ./.     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/1     0/0     0/0     0/0     0/0     0/1     0/0
1       131     rs11166274      T       C       .       .       PR      GT      1/1     ./.     0/0     0/0     0/1     1/1     ./.     1/1     1/1     ./.     0/0     0/0     0/1     0/1     ./.     0/1     1/1     1/1     0/1     0/1     1/1     1/1     0/0     0/0     0/0     0/0     0/1     0/1     0/1     0/1     0/0     1/1     0/1     0/1     0/1     ./.     0/1     1/1     0/1     0/1     0/0     0/1     1/1     0/1     1/1     1/1     0/1     1/1     0/0     1/1     0/1     0/0     0/0     1/1     0/0     0/0     1/1     0/0     0/1     0/1     ./.     1/1     0/1     1/1     0/0     1/1     0/1     0/1     0/0     0/1     0/1     ./.     ./.     0/1     0/0     0/1     0/1     ./.     1/1     0/1     0/0     0/1     0/1     0/1     1/1     0/0     0/1     0/1     0/1     0/0     1/1     1/1     0/0     0/0     0/0
1       135     rs11166275      C       A       .       .       PR      GT      ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.
1       178     rs12046770      T       C       .       .       PR      GT      0/0     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/1     0/1     0/1     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/1     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/0
1       184     rs1556947       C       T       .       .       PR      GT      0/0     0/1     1/1     0/0     0/1     0/0     0/1     0/1     0/0     0/0     0/1     0/0     0/0     0/1     0/1     0/0     0/0     0/1     0/0     0/1     0/1     0/1     1/1     1/1     1/1     0/1     0/0     0/0     0/1     0/1     0/1     0/1     0/0     0/1     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/1     0/0     0/1     0/1     0/0     0/1     0/0     0/1     0/1     0/0     0/1     1/1     0/0     0/1     0/1     0/0     1/1     0/1     0/0     1/1     0/1     0/0     0/1     0/1     0/0     0/1     0/0     0/1     0/1     0/0     1/1     0/0     0/1     0/1     0/0     0/0     0/0     1/1     0/0     0/1     1/1     0/1     0/1     0/0     0/0     1/1     0/0     0/0     0/0     1/1     0/0
1       186     rs10493912      T       C       .       .       PR      GT      ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.
1       192     rs9660000       G       A       .       .       PR      GT      1/1     0/0     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/1     0/0     0/0     0/0     1/1     1/1     0/0     0/1     0/1     0/1     0/1     0/0     1/1     0/1     0/0     0/0     0/0     0/0     0/1     0/1     0/1     0/0     0/0     0/1     0/0     0/0     0/1     0/0     1/1     0/0     0/0     0/1     0/0     0/0     0/1     0/0     0/0     1/1     0/0     0/0     0/0     0/1     0/0     0/1     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0
finswimmer commented 5 years ago

Hello again,

these positions looks definitely wrong. Even it is possible to get the correct position ,based on the rs ids, I would recommend to find out what happens on the way to that vcf.

As this is not a bcftools issue I would suggest you close your question here and start a new one on biostars. There will be people that are familiar the software you use.

@pd3: I hope it is ok to redirect people to other platforms, if their problem is not a bcftools issue.

fin swimmer

idalarsson commented 5 years ago

I will do so! Thank you for your time :)

pd3 commented 5 years ago

Hi, I am late to this thread. @finswimmer , yes, it's absolutely OK to redirect people to other solutions. However, in this case your VCF contains the rsIDs and you should be able to fix the file easily using the following command

bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -i dbSNP.vcf.gz

where the file dbSNP.vcf.gz has been downloaded from https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf

idalarsson commented 5 years ago

@pd3 Thank you for your suggestion! As you can see in my first post, I tried using the command you mention, but with the All_20180418.vcf.gz instead of dbSNP.vcf.gz, since this was the one suggested in the instructions I followed (here: http://samtools.github.io/bcftools/howtos/plugin.fixref.html). So my questions are 1) what will be the difference if I use the dbSNP.vcf.gz and 2) I cannot find the file when I follow your link, could you point me to the exakt location of the file?

Thank you!

finswimmer commented 5 years ago

Hello again,

before choosing a dbSNP vcf file you must decide which reference genome you like to use. GRCh37 (hg19) or GRCh38 (hg38)? GRCh37 is outdated since Dec 2013. So I strongly recommend using GRCh38. You can find the file here

@pd3 is absolutely right that you can use bcftools this way to fix your vcf file. But if you use the tool, that generate the initial vcf file, in that way it produces wrong coordinates, what else might be wrong? So please, go back to start and check what's going on there othwerwise there is a high risk that your final results will be wrong and/or you encounter much more problems on the way to there.

fin swimmer

idalarsson commented 5 years ago

@finswimmer Again, thank you for your suggestions, as I said before I am still quite new to all this so I still learn a lot. Did you leave a link in your post? I cannot see it

finswimmer commented 5 years ago

Oh, github doesn't recognize the ftp protocol in the url (bug?) and so it doesn't include the link. I changed it to http as this works as well.

Celestic1 commented 5 years ago

I am getting the same error as well, were you able to find a fix? My VCF file has a only 113 mismatches and 1962 lines, but when I run fixref, all 1962 become unresolved and then gets deleted because of the -d flag.

idalarsson commented 5 years ago

@Celestic1 The problem was partly fixed when I first converted my VCF file to BCF using the view command in bcftools and then ran the fixref, at least then I didn't get an empty outfile. Maybe that could work for you? However, the outfile didn't seem 100 % correct so I've started to suspect that there is some other error with my VCF file not connected to the fixref. Hope that helped!

Celestic1 commented 5 years ago

@idalarsson Okay thank you, I will try it out and let you know how it goes.

Edit: Update, converting my files to BCF did not work, I will have to ask around since I can't find my answer anywhere online :( and my work is bugging me to get it done asap lol.

DaynaSmith commented 2 years ago

Four years later and I have been faced with the same problem :( Was there a solution to this issue @idalarsson ?