odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
56 stars 9 forks source link

"Wrong order in your genetic map file" in concatenated genetic files #59

Closed chunxuan-hs closed 10 months ago

chunxuan-hs commented 10 months ago

I am trying to phase a trios dataset (only three samples, offspring, father, mather). I dowloaded the tar.gz file of hg38 genetic maps in the resource folder, unzipped it and concatenated all files (execpt X) to a single file.

However, I glot the following error, ERROR: Wrong order in your genetic map file 248924793bp / 286.27923cM > 12994bp / 0.00000cM

It turns out the position locates at the end of chr1.

248924793 1 286.279234 12994 2 0 15491 2 0.000848 15672 2 0.000908

How to specify the genetic files correctly, and is a concatenated file including all chr allowed?

My codes:

#- Genetics maps of all chrome
path_map="genetic_map_hg38_shapeit5.gz"
phase_common --input $path_vcf \
    --pedigree $path_fam  \
    --reference $path_ref \
    --region 22 \
    --thread 8 \
    --map $path_map \
    --output test.vcf.gz

Many thanks!

srubinacci commented 10 months ago

Hi,

The map files are not meant to be concatenated or unzipped. Just pass the correct map file to the software according to the chromosome you are phasing.

Simone

chunxuan-hs commented 10 months ago

@srubinacci Thanks for the info!

Another related question, in my original VCFs there are many other data, like QUAL, DP, INFO, etc. However, in the phased VCF all those info were removed. Did I make a mistake with shapeit5? Is there a way to update my orginal VCF with phased haplotype info?

Thanks!

The commands:

for CHR in {1..22}; do
    MAP=chr${CHR}.b38.gmap.gz
    OUT=Element_DBS_chr${CHR}.bcf
    LOG=Element_DBS_chr${CHR}.log

    phase_common --input $path_vcf \
        --map $MAP \
        --region $CHR \
        --pedigree $path_fam  \
        --reference $path_ref \
        --thread 10 \
        --log $LOG \
        --output $OUT
done

An example. The input record,

1       61987   chr1_61987_A_G;rs76735897       A       G       33.0    .       AF=0.5;AQ=33;AN=6;AC=3;ANN=G|upstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|transcript|ENST00000492842.2|transcribed_unprocessed_pseudogene||n.-962A>G|||||962|,G|upstream_gene_variant|MODIFIER|OR4F5|ENSG00000186092|transcript|ENST00000641515.2|protein_coding||c.-3492A>G|||||3432|,G|intron_variant|MODIFIER|OR4G11P|ENSG00000240361|transcript|ENST00000642116.1|processed_transcript|2/2|n.214-929A>G||||||;COMMON;FREQ=1000Genomes:0.6042,0.3958|GnomAD:0.5273,0.4727|KOREAN:0.8738,0.1262|Korea1K:0.8633,0.1367|NorthernSweden:0.4826,0.5174|Qatari:0.5146,0.4854|SGDP_PRJ:0.4007,0.5993|Siberian:0.35,0.65|TOMMO:0.8642,0.1358|dbGaP_PopFreq:0.4882,0.5118;GNO;RS=76735897;SSR=0;VC=SNV;dbSNPBuildID=131;SNP;VARTYPE=SNP;Cases=0,1,1;Controls=1,0,2;CC_TREND=1.000e+00;CC_GENO=2.231e-01;CC_ALL=8.000e-01;CC_DOM=6.667e-01;CC_REC=6.667e-01  GT:DP:AD:GQ:PL:RNC      0/1:23:22,1:7:6,0,24:.. 1/1:26:0,26:27:33,28,0:..       0/0:36:35,1:30:0,30,40:..

And after phasing, the record:

1       61987   chr1_61987_A_G;rs76735897       A       G       .       .       AC=3;AN=5102    GT      1|0     1|1     0|0
srubinacci commented 10 months ago

You did not do anything wrong. We remove these tags because we would need to store them in memory the whole time otherwise. If you need them back, you can use bcftools annotate to reannotate your file.

Simone

chunxuan-hs commented 10 months ago

Thanks for the quick reply, bcftools worked well!