Empty output "Insufficiently many confident reads for aggregating across runs"

rwdavies / QUILT

GNU General Public License v3.0

45 stars 10 forks source link

Empty output "Insufficiently many confident reads for aggregating across runs" #27

Open marromesc opened 4 months ago

marromesc commented 4 months ago

Hi,

This is my first time using QUILT, I am trying to impute genotypes in the location of ~300 SNVs.

Preprocessing of the bam files I used included alignment to GRCh38 with BWA, marking and removing PCR duplicates with Picard and filtering out reads with MAPQ < 37 with samtools.

It works well for most of them but there's one specific location that gives me an empty output.

With buffer=1000 it throws the following message: "Insufficiently many confident reads for aggregating across runs"

While buffer=500000 continues giving me an empty output but throws the following message: "There are 5 out of 18 regions that have been flipped by consensus"

1) What do these error messages mean? 2) What criteria to pick up the appropriate buffer cutoff? 3) Is it good practice to remove duplicates and low-quality reads before running QUILT?

Thank you.

Best, Maria

Zilong-Li commented 4 months ago

Hi,

I think QUILT eats bam files and will take care of them. Normally you don't need to preprocess the bam files, especially no need for bam quality score control. The error message is due to too few reads left after your preprocessing. And increasing buffer won't help out imputing regions without any reads. Buffer size of 250000 is big enough for most regions.

marromesc commented 4 months ago

Hi,

I checked the BAM file in IGV and indeed it has only one read covering that region, but you expect that from shallow sequencing data I guess. Is there minimum number of reads to impute a genotype?

Thanks for your answer, your clarifications about bam files preprocessing and buffer parameter are very useful.

Best, Maria

rwdavies commented 4 months ago

The messages from your first post related to how QUILT tries to do phasing

Basically, it tries multiple starts (normally 7), and gets read assignements from each of them. Then on the final phasing round, it tries to get a best set from them, and proceed. That's what the "There are 5 out of 18 regions that have been flipped by consensus" message meant, the consensus process is trying to come up with a best read phasing. Similarly, "Insufficiently many confident reads for aggregating across runs", means it can't do this process, as there are too few "confident" reads (reads that map to one or the other haplotype confidently). I wouldn't consider any of these to be error messages, the program should still run, they should just be informative.

I wouldn't say there's a minimum number of reads or depth to impute a sample. With some mice samples I've seen excellent results with less than 0.1X. It really depends on how related the samples are, and how long the LD blocks are.

Hope that helps, Robbie