twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

Question: calculating phaseblock N50 and switch error #25

Closed zhengzhenxian closed 1 year ago

zhengzhenxian commented 1 year ago

Hi all,

Any suggestion about calculating the phase block N50 and switch error using the phased VCF and haplotagged BAM(standard GIAB sample) produced by longphase? I try to use the script here, but no clue about the input format. Any tools or scripts suggested?

Thanks in advance!

Zhenxian

twolinin commented 1 year ago

Hi

You can calculate swith error by whatshap compare.

whatshap compare benchmark.vcf phased.vcf

You can calculate block N50 by following script.

grep -e "0|1" -e "1|0" $1.vcf |
grep -v "#" |
awk '{print $1 "\t" $2 "\t" $10}' |
sed 's/:/\t/g' |
awk '{print $1 "\t" $2 "\t" $3 "\t" $9 }' |  # $1 is chr, $2 is pos, $3 is the value of GT tag $9 is the value of PS tag 
sort -k1n -k4n -k2n |
awk '{
    if( $4 != upPS ){
        if(NR>1 && count > 1){
            print upChr "\t" strPos "\t" upPos - strPos
        }

        strPos = $2
        count=0
    }
    upPS = $4
    upPos = $2
    upChr = $1
    count++
}
END{
    print upChr "\t" strPos "\t" upPos - strPos
}' | sort -k3rn > $1.block

awk '
ARGIND==1{
    #sum of block length
    sum+=$3
}
ARGIND==2{
    count+=$3
    if(count>=sum/2){print "N50: " $3 ;exit;}
}'  $1.block $1.block

Thanks.

zhengzhenxian commented 1 year ago

Thanks so much, very helpful to me!