Closed jelber2 closed 4 years ago
What is the fraction of the reference genome that dip.bed covers?
About 70%
70% is not very good, but not too bad. It still seems weird that there are no variant calls. It is almost impossible. You can use paftools.js call
. For a single genome, it should be very close to dipcall. It goes through fewer steps, so less likely to have complex issues.
I simulated two haplotypes (see https://github.com/DecodeGenetics/Ratatosk/issues/12#issuecomment-691619368) then simulated 30x PacBio reads and 60x Illumina reads and corrected the PacBio reads with —k1 31 with Ratatosk then corrected those PacBio reads with again with —k1 41. I then used Miniasm but only assembled with overlaps with dv:f:0.035 (<0.035). That is what miniasm.corr.ratatosk.035.fasta ref is.
I could provide the three input files for dipcall tomorrow if you so desire.
Note: bgzipped (not gzipped) https://www.dropbox.com/s/pfu21zeeud4w60p/miniasm.corr.ratatosk.035.fasta.gz https://www.dropbox.com/s/pvqpwvusqzdt222/GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz https://www.dropbox.com/s/6w6bycompustx6j/GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype1.fasta.gz
I am guessing I am doing something wrong with dipcall as another assembly with over 90% of reference genome covered by the dip.bed file has an empty dip.vcf file? Any help would be appreciated.
Try on real data first.
Yes, as these are simulated Paternal and Maternal haplotypes, perhaps that is not really appropriate, and real Paternal and Maternal haplotypes are needed.
For typical real assemblies, dipcall has been reasonably stable. Multiple groups are using the pipeline and they haven't reported significant issues.
I had the same issue with an empty dip.vcf.gz as well (header is written but no variants show up in the file). I dug a bit into the vcfpair command within the dipcall-aux.js script that creates the dip.vcf.gz file.
On line 71:
var re_ctg = is_male? /^(chr)?([0-9]+|X|Y)$/ : /^(chr)?([0-9]+|X)$/;
and on line 99:
if (!re_ctg.test(t[0])) continue;
It seems like this line is here to account for the sex chromosome but it has the unintended(?) effect of excluding chromosomes with non-standard names. For instance, I was analyzing a region of chr5 and had renamed the chromosome to chr5_start_end, which then caused all variants to be skipped in the generation of the dip.vcf.gz file. Removing line 99 corrects this issue for me. I have a feeling the original issue opened here is also related to this.
Just wanted to share in the hopes that helps someone else
Thank you! I will consider to add an option to disable this filtering. It is there to prevent outputting those _random
contigs.
Oh cool! Deleting line 99 of dipcall-aux.js
makes dipcall
work for me!
Note that one must index the reference with samtools faidx
before, and the reference can't be compressed with bgzip because dipcall.kit/htsbox pileup
runs but doesn't produce an error.
Also, I think dipcall
has the potential to be integrated with QUAST
for diploid genomes as the current implementation of QUAST
doesn't seem to handle the possibility for the compared mismatches and indels that are actually from the other haplotype.
QUAST calls substitutions and indels between copies in CNVs and around misassemblies. It is not a good way to evaluate base accuracy in general. Dipcall and paftools.js call (for collapsed assemblies) are more careful. They usually give lower substitution/indel rate.
Okay, thanks. It might be interesting if QUAST
developers would replace that workflow with dipcall
and/or paftools.js call
.
The final output of dipcall includes two files: prefix.dip.vcf.gz and prefix.dip.bed. A raw variant call is made in the VCF if a non-reference allele is observed in any alignments >=50kb and mapped with mapping quality >=5.
So if
prefix.dip.vcf.gz
is empty (has no variants), then given the example below,miniasm.corr.ratatosk.035.fasta
does not have non-reference alleles in the high confidences areas ofprefix.dip.bed
?