google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.12k stars 701 forks source link

Indel calling for long read sequencing #836

Open may199128 opened 4 days ago

may199128 commented 4 days ago

Describe the issue:

We are currently using two different long-read sequencing systems: PacBio Direct and PacBio Capture. We are encountering two specific issues:

  1. PacBio Direct Data (top track): Despite having reads shown in both the IGV and VCF files, the genotype (GT) is marked as "./.," and the genotype quality (GQ) is very low. This indel locus has higher coverage (141x) than the average coverage across the genome (128x)

2.For the PacBio Capture data (bottom track) has an average genome coverage of 2897x. Even after setting the mapping quality to >30, the read count at this indel locus is still 3000x, which is significantly higher than the read count indicated in the VCF file. Why is there such a discrepancy between the read counts in IGV and VCF?

I have used the default parameters (--model_type PACBIO) for both dataset.

Thank you.

Long read_DeepVariant

kishwarshafin commented 3 days ago

Hi @may199128 ,

There can be several reasons for that:

  1. PacBio Direct Data (top track): Despite having reads shown in both the IGV and VCF files, the genotype (GT) is marked as "./.," and the genotype quality (GQ) is very low. This indel locus has higher coverage (141x) than the average coverage across the genome (128x)

The lower GQ can be caused by several issues. If this is a variant dense region, then the model can be less confident in correctly calling the variant. Given that it looks like a structural variant, have you considered using a structural variant caller like PBSV for these cases?

  1. For the PacBio Capture data (bottom track) has an average genome coverage of 2897x. Even after setting the mapping quality to >30, the read count at this indel locus is still 3000x, which is significantly higher than the read count indicated in the VCF file. Why is there such a discrepancy between the read counts in IGV and VCF?

The could be several reasons for seeing the difference: a. We perform a downsampling of reads on a given window to make sure the processing time is manageable. So, some reads can get discounted during that process. b. For a specific variant, reads with lower mapping quality will not be counted. c. For a specific variant, reads that contain that variant with lower base quality then set is discounted.

Please note that our pileup image height is 100, so we can only represent 100 reads per variant within an example. Having 3000x would mean the reads will be downsampled to a fraction that can be represented in the example.

Please let me know if you have any other questions.