lh3 / dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies
MIT License
94 stars 9 forks source link

Interpretation of empty prefix.dip.vcf.gz #1

Closed jelber2 closed 3 years ago

jelber2 commented 4 years ago

The final output of dipcall includes two files: prefix.dip.vcf.gz and prefix.dip.bed. A raw variant call is made in the VCF if a non-reference allele is observed in any alignments >=50kb and mapped with mapping quality >=5.

So if prefix.dip.vcf.gz is empty (has no variants), then given the example below,

/genetics/elbers/dipcall.kit/run-dipcall -t 75 prefix miniasm.corr.ratatosk.035.fasta \
GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz \
GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype1.fasta.gz > miniasm.corr.ratatosk.035.mak

miniasm.corr.ratatosk.035.fasta does not have non-reference alleles in the high confidences areas of prefix.dip.bed?

lh3 commented 4 years ago

What is the fraction of the reference genome that dip.bed covers?

jelber2 commented 4 years ago

About 70%

lh3 commented 4 years ago

70% is not very good, but not too bad. It still seems weird that there are no variant calls. It is almost impossible. You can use paftools.js call. For a single genome, it should be very close to dipcall. It goes through fewer steps, so less likely to have complex issues.

jelber2 commented 4 years ago

I simulated two haplotypes (see https://github.com/DecodeGenetics/Ratatosk/issues/12#issuecomment-691619368) then simulated 30x PacBio reads and 60x Illumina reads and corrected the PacBio reads with —k1 31 with Ratatosk then corrected those PacBio reads with again with —k1 41. I then used Miniasm but only assembled with overlaps with dv:f:0.035 (<0.035). That is what miniasm.corr.ratatosk.035.fasta ref is.

jelber2 commented 4 years ago

I could provide the three input files for dipcall tomorrow if you so desire.

Edit: links

Note: bgzipped (not gzipped) https://www.dropbox.com/s/pfu21zeeud4w60p/miniasm.corr.ratatosk.035.fasta.gz https://www.dropbox.com/s/pvqpwvusqzdt222/GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz https://www.dropbox.com/s/6w6bycompustx6j/GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype1.fasta.gz

jelber2 commented 3 years ago

I am guessing I am doing something wrong with dipcall as another assembly with over 90% of reference genome covered by the dip.bed file has an empty dip.vcf file? Any help would be appreciated.

lh3 commented 3 years ago

Try on real data first.

jelber2 commented 3 years ago

Yes, as these are simulated Paternal and Maternal haplotypes, perhaps that is not really appropriate, and real Paternal and Maternal haplotypes are needed.

lh3 commented 3 years ago

For typical real assemblies, dipcall has been reasonably stable. Multiple groups are using the pipeline and they haven't reported significant issues.

ScottMastro commented 3 years ago

I had the same issue with an empty dip.vcf.gz as well (header is written but no variants show up in the file). I dug a bit into the vcfpair command within the dipcall-aux.js script that creates the dip.vcf.gz file.

On line 71: var re_ctg = is_male? /^(chr)?([0-9]+|X|Y)$/ : /^(chr)?([0-9]+|X)$/;

and on line 99: if (!re_ctg.test(t[0])) continue;

It seems like this line is here to account for the sex chromosome but it has the unintended(?) effect of excluding chromosomes with non-standard names. For instance, I was analyzing a region of chr5 and had renamed the chromosome to chr5_start_end, which then caused all variants to be skipped in the generation of the dip.vcf.gz file. Removing line 99 corrects this issue for me. I have a feeling the original issue opened here is also related to this.

Just wanted to share in the hopes that helps someone else

lh3 commented 3 years ago

Thank you! I will consider to add an option to disable this filtering. It is there to prevent outputting those _random contigs.

jelber2 commented 3 years ago

Oh cool! Deleting line 99 of dipcall-aux.js makes dipcall work for me!

Note that one must index the reference with samtools faidx before, and the reference can't be compressed with bgzip because dipcall.kit/htsbox pileup runs but doesn't produce an error.

Also, I think dipcall has the potential to be integrated with QUAST for diploid genomes as the current implementation of QUAST doesn't seem to handle the possibility for the compared mismatches and indels that are actually from the other haplotype.

lh3 commented 3 years ago

QUAST calls substitutions and indels between copies in CNVs and around misassemblies. It is not a good way to evaluate base accuracy in general. Dipcall and paftools.js call (for collapsed assemblies) are more careful. They usually give lower substitution/indel rate.

jelber2 commented 3 years ago

Okay, thanks. It might be interesting if QUAST developers would replace that workflow with dipcall and/or paftools.js call.