hsinnan75 / MapCaller

MapCaller – An efficient and versatile approach for short-read alignment and variant detection in high-throughput sequenced genomes
MIT License
29 stars 5 forks source link

Meaning of DUP in vcf output #51

Open tseemann opened 4 years ago

tseemann commented 4 years ago

How do i interpret this line?

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  unknown
Chr     104092  .       T       <*>     0       DUP     END=104221

It was introduced only in the ChangeLog as:

0.9.9.21: Reported CNV (N>=2) regions with the flag "DUP" (experimental).

Does it mean the region from 104092 (which starts with a T) to 104221 (which is 109 bp long) is duplicated exactly somewhere else in the genome? If so, where is it duplicated? What if it is triplicated? how do we know what N you predicted for it? Does this have anything to do with multi-mapping reads?

hsinnan75 commented 4 years ago

Yes, the region from 104092 to 104221 is duplicated since this region is mapped with multi-mapping reads. MapCaller identifies CNVs with multi-mapping reads. MapCaller does not predict the copy numbers since they are difficult to estimate.

tseemann commented 4 years ago

So the reads aligning to this region in the reference genome ALSO align to another region in the reference genome.

Where is that other region?
Can you add tags to links all the duplicate regions? Maybe put DUP00001 in the ID column for all the same DUP regions?

hsinnan75 commented 4 years ago

It is a good suggestion. I'll try to implement this feature.