lh3 / dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies
MIT License
96 stars 10 forks source link

Interpretation of bed files #10

Open cjain7 opened 2 years ago

cjain7 commented 2 years ago
$ cat prefix.dip.bed | awk 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
2823519412
$ cat prefix.hap1.bed | awk 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
2690214366
$ cat prefix.hap2.bed | awk 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
2817991873

As per the documentation: The prefix.dip.bed file gives the confident regions. A base is included in the BED if 1) it is covered by one >=50kb alignment with mapQ>=5 from each parent and 2) it is not covered by other >=10kb alignments in each parent. Based on this, shouldn't the length of intervals in prefix.dip.bed be lower than prefix.hap1.bed and prefix.hap2.bed, i.e., should prefix.dip.bed have been intersection of the two haplotype-specific bed files?

Please suggest what is the relationship between these three bed files.

lh3 commented 2 years ago

This is probably caused by the sex chromosomes. chrX and chrY are handled differently.