odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
56 stars 9 forks source link

Phase_common: AC field is needed in file #45

Open AkshajD opened 1 year ago

AkshajD commented 1 year ago

I am running phase_common on a BCF file (as well as variants to exclude in another BCF) acquired from an Axiom array.

While running, it gives an error stating that AC field is needed in file for each of the chromosomes. A sample of this is included below.

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 03/07/2023 - 15:32:31

Files:
  * Input         : [PATH/output/CANPREDDICT.FINAL_GENOTYPING_QC_2021_03_25.AxiomGT1.MoChA.chrX.bcf]
  * Reference     : [PATH/GRCh37/ALL.chrX.phase3_integrated.20130502.genotypes.bcf]
  * Genetic Map   : [PATH/output/genetic_map.chrX.txt]
  * Output        : [PATH/output/CANPREDDICT.FINAL_GENOTYPING_QC_2021_03_25.AxiomGT1.MoChA.chrX.pgt.bcf]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 4 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data:

ESC[31mERROR: ESC[0mAC field is needed in file

Since I am still fairly new to genotyping pipelines, I am unfamiliar with what the AC field is and how I can make sure it is present in my file (as well as which file (aka the input BCF or one of the references/genetic maps) needs it).

Would appreciate any advice, thank you!

RJHFMSTR commented 12 months ago

Hi,

AC field is required because it is the Allele Count field, which is used to compute allele frequency and keep only common variants above the selected threshold.

If your input data does not have this AC field, you can compute it using the bcftools fill-tags plugin (https://samtools.github.io/bcftools/howtos/plugin.fill-tags.html)

Best, Robin

freeseek commented 11 months ago

Notice that +fill-tags can be up to 10x slower than +fill-AN-AC, so the latter is recommended. You can also use bcftools view -c 0 to fill the AN and AC fields though this recomputes AN and AC only if they are not already present. Why is it that you have to have AN and AC though? Couldn't SHAPEIT5 use the bcf_calc_ac() function from htslib/htslib/vcfutils.h to compute AN and AC? This is a function that would allow to use existing AN and AC values if present and compute them from genotypes otherwise:

/**
 *  bcf_calc_ac() - calculate the number of REF and ALT alleles
 *  @header:  for access to BCF_DT_ID dictionary
 *  @line:    VCF line obtained from vcf_parse1
 *  @ac:      array of length line->n_allele
 *  @which:   determine if INFO/AN,AC and indv fields be used
 *
 *  Returns 1 if the call succeeded, or 0 if the value could not
 *  be determined.
 *
 *  The value of @which determines if existing INFO/AC,AN can be
 *  used (BCF_UN_INFO) and and if indv fields can be split (BCF_UN_FMT).
 */
HTSLIB_EXPORT
int bcf_calc_ac(const bcf_hdr_t *header, bcf1_t *line, int *ac, int which);

And this is how the function is used in bcftools/plugins/fill-AN-AC.c:

bcf1_t *process(bcf1_t *rec)
{
    hts_expand(int,rec->n_allele,marr,arr);
    int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
    if ( ret>0 )
    {
        int i, an = 0;
        for (i=0; i<rec->n_allele; i++) an += arr[i];
        bcf_update_info_int32(out_hdr, rec, "AN", &an, 1);
        bcf_update_info_int32(out_hdr, rec, "AC", arr+1, rec->n_allele-1);
    }
    return rec;
}
mubita3 commented 9 months ago

I am relatively new to bioinformatics. I would like to do haplotype phasing using phase_common (shapeit5) before carrying out genotype imputation using Beagle 5.4 but I am getting an ERROR: AC field is needed in file. The code I ran is shown below: `[SHAPEIT5] phase_common (jointly phase multiple common markers)

Files:

Parameters:

Reading genotype data: ERROR: AC field is needed in file`

I do not understand why this error happened because I filled tags to my input file using the code: bcftools +fill-tags pk_hg19_chr6.bcf -Ob -o pk_hg19_chr6_tags.bcf -- -t all

The header of the vcf file has all the necessary tags e.g.

fileformat=VCFv4.3

Contig=

FORMAT=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

contig=

filedate=20230825

reference=file:/home/mubita/bioinformatics/input_liftover/Homo_sapiens.GRCh37.dna.chromosome.6.fa

source=VCF-simplify

In addition, an excerpt of the vcf file before filling the tags looks like this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 101_Ov 102_Ov 103_Ov 104_Ov 107_Ov 110_Ov 113_Ov 114_Ov 115_Ov 117_Ov 118_Ov 121_Ov 122_Ov 124_Ov 127_Ov 128_Ov 129_Ov 130_Ov 132_Ov 133_Ov 134_Ov 136_Ov 138_Ov 139_Ov 141_Ov 144_Ov 147_Ov 148_Ov 150_Ov 152_Ov 153_Ov 155_Ov 156_Ov 157_Ov 159_Ov 162_Ov 163_Ov 165_Ov 166_Ov 167_Ov 168_Ov 169_Ov 170_Ov 171_Ov 172_Ov 173_Ov 177_Ov 178_Ov 179_Ov 181_Ov 185_Ov 187_NO 188_NO 189_NO 192_NO 194_NO 196_NO 198_NO 199_NO 201_NO 206_NO 207_NO 210_NO 211_NO 213_NO 214_NO 215_NO 216_NO 217_NO 219_NO 223_NO 224_NO 225_NO 226_NO 227_NO 231_NO 234_NO 235_NO 238_NO 239_NO 240_NO 241_NO 244_NO 245_NO 246_NO 248_NO 249_NO 250_NO 254_NO 255_NO

6 160557643 rs2282143 C T . PASS AA=C;AC=33;AF=0.1138;AN=290;NS=145 GT 0/1 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/1 0/0 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/1 0/0 0/1 0/0 0/0 1/1 0/0 0/1 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 ./. 0/1 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0

6 160560881 rs72552763 ATGA A . PASS AA=ATGA;AC=9;AF=0.03061;AN=294;NS=147 GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1 0/1 0/0 0/0 0/0

Given this information, how can I get rid of this error so that shapeit (phase_common) runs successfully?

ndokmai commented 6 months ago

@mubita3 Not sure if your issue is resolved, but for those still looking for an answer, both your input and reference panel files need the AC field.