Closed ASLeonard closed 3 years ago
@ASLeonard ,
The issue we have had with trio-binned assembly is coverage. If you take each haplotype and polish them independently the coverage becomes really difficult to solve. However, in this new pipeline, you should be able to use a phased bam (possibly diploid) and run PEPPER HP-DeepVariant
(the last two modules) and get two VCFs that are haplotype specific. That way you are ignoring the PEPPER SNP-Margin to phase and replacing the phasing with trio binning. Let me know if you would be interested in that and I can give you commands to run.
Yeah that would be great. I'm using the singularity image, and have played around previously with running the individual modules so it fits nicely into our cluster queues, so should be able to run the commands you suggest.
Thanks!
@ASLeonard ,
Sorry, I should have asked this before. Are you polishing a diploid genome?
Yes diploid ~2.7g mammal. The basic approach is triobinning ~120x ONT into 60x coverage per haplotype using parental short reads, then assembling each with shasta. That is why I'd prefer haploid polishing, as the two haplotypes are already "separated".
Ok, would it be possible for you to tag your haplotype-1 reads with HP:1
tag and haplotype-2 reads with HP:2
tag in the bam? That way DeepVariant can utilize the full coverage and see the difference between two haplotypes. You have to generate a tab delimited file in this format:
read_id1 none
read_id2 H1
read_id3 H2
It has to be "read_id\t[none|H1|H2]
tab delimited.
Then you can use the tagBam
executable available from the docker/singularity of PEPPER-DeepVariant.
docker run --ipc=host \
-v /data:/data \
kishwars/pepper_deepvariant:r0.4 \
tagBam \
IN_BAM_FILE \
TAG_BED_FILE \
OUT_BAM_FILE
So this approach would be using the entire read set to map against the diploid genome, followed by haplotype tagging based on the binned read IDs? And then using this bam as the input for the pepper_hp call_variants
followed by run_deepvariant
?
That is correct. In assembly polishing mode, PEPPER-DeepVariant is haplotype sensitive but uses all reads. That means PEPPER will pick candidates that strictly belong to haplotype-1 or haplotype-2. This is how it works:
1) PEPPER HP generates two candidates HP1.vcf
and HP2.vcf
. HP1.vcf
are all of the candidates that PEPPER found from (HP:1+HP:0
) reads and HP2.vcf
will be all of the candidates from haplotype-2 (HP:2+HP:0
) reads.
2) DeepVariant HP1
will run on HP1.vcf
, this mode will sort the reads in a way that have HP:1 reads on top but it uses all reads in the pileup
3) DeepVariant HP2
will run on HP2.vcf
, this mode will sort the reads to have HP:2
reads on top and uses all reads.
This way you get information from both haplotypes while polishing, that's how we were able to achieve such high accuracy in the human genomes.
In your setup: You should: 1) Align all of the reads to Shasta_hap1.fasta 2) Tag the reads with their haplotags you got by trio-binning 3) Run PEPPER HP and get the HP1.vcf 4) Run DeepVariant HP1 and get Shasta_pepper_deepvariant_hap1.fasta
Then repeat: 1) Align all of the reads to Shasta_hap2.fasta 2) Tag the reads with their haplotags you got by trio-binning 3) Run PEPPER HP and get the HP2vcf 4) Run DeepVariant HP2 and get Shasta_pepper_deepvariant_hap2.fasta
You can further restrict your polishing by excluding HP:0 reads, but I'd suggest not to as trio-binning will be limited to easy-to-map regions because of short-reads. Does this plan fit you polishing experiment?
Thanks, I've tried giving that a go. Everything was working okay up to step 4, running deepvariant. This step seems to be running incredibly slow compared to expectations.
I'm not sure if it is related to the proposed variants coming from pepper, but I'm seeing something like 50-100x slowdown compared to when I've run deepvariant itself on other data. Each logging interval in the make_examples step is taking around 70 seconds, compared to something like 1 second I've seen in deepvariant.
Perhaps an issue is the 120x diploid coverage, rather than using the 60x triobinned coverage. It doesn't necessarily explain why make_examples is running slower, but I've already spent more time on deepvariant make_examples alone (which was killed before finishing by cluster time limit) than the preprint suggests for 75x coverage ONT on 96 vCPU.
I believe the --hp_tag_for_assembly_polishing "1"
is unique to the pepper version, so I can't explicitly compare with deepvariant. If you have any suggestions, I'll be happy to try.
Also before your suggestions, I had already got to the run_deepvariant step running only the binned data on the haplotype. This also was killed by time limit after 24 hours on 24 cpus, so it doesn't appear like any configuration with this data is likely to complete within the reported runtimes.
@ASLeonard ,
Assembly polishing is different than variant calling so we changed DeepVariant accordingly to solve the problem better. The --hp_tag_for_assembly_polishing "1"
flag sorts the images based on their haplotype so you can use all reads while calling on one haplotype. For the other haplotype (HP:2
), it'll be --hp_tag_for_assembly_polishing "2"
.
One of the things you can quickly try is to increase paritition_size
in the make_examples args to 10kb, making it look like: --make_examples_extra_args=partition_size=10000,...
. Right now it is set to 1kb so the process reads the sequencing data many many times. We have found this issue internally and will hopefully fix it in the next release. Please don't set it too high then you will end up having chunks that will take much longer to finish.
Also, if you could kindly provide me some information, that'd be very helpful:
1) Which assembler did you use to assemble the genome?
2) How many contigs did you get per haplotype? If collapsed then how many total contigs?
3) How does the haplotype separation look: how many reads are in HP1, HP2, HP0 groups?
4) Did you do trio-binning, assembly, lift-over HP tags method? If yes, can you confirm the bams are now tagged? A simple samtools view
should be able to verify this.
5) How many candidates are reported by PEPPER? You can find it with bcftools stats
, if not, a simple zcat PEPPER_HP.vcf.gz | grep -v '#' | wc -l
would do.
I gave the larger partition_size a try, the initial results look like about 40s per interval, so still over an order of magnitude slower than the worst deepvariant performance I've seen.
For the other information, this particular assembly is
-F 0x904
HP:i:1
and HP:i:2
tags. Less than 0.02% of reads (accounting for 0.001% of the bases) could not be binned, so I didn't align these reads and thus there are no HP0 reads.6SN 0 number of records: 15316066
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 1704847
SN 0 number of MNPs: 0
SN 0 number of indels: 13670957
SN 0 number of others: 0
SN 0 number of multiallelic sites: 631262
SN 0 number of multiallelic SNP sites: 10757
@ASLeonard ,
Thank you! The number of INDELs look very high. 13670957
would mean you are starting from an assembly that is about Q26/Q27. What is the basecaller version that you are using?
The merqury QV of the unpolished assembly is Q28, so pretty bang on your estimate! I believe the data is between guppy 3-3.5.
Ok, thanks, that is a bit low quality for the genome assemblies we generally start with. The best assemblies come with guppy v3.6.0
or higher as it resolves many known systematic issues with ONT and provides a decent assembly accuracy.
I may have to tune the parameters for you to reduce the total number of variants reported by PEPPER. That is what causes the delay in your polishing. Make examples step itself would take 3 days with 28CPUs. Do you have additional resources that you can use? Otherwise, I can give you a modified version where parameters are tuned to lower the number of proposed candidates, but I'd worry about the downstream effect of such change.
Just for a reference. Your reads are somewhere around the green distribution here, so it has much more errors than the reads we trained our models on. So PEPPER is picking up a lot of error candidates which is causing things to be slower. We expect things to be around the purple distribution.
@ASLeonard as the issue was correctly identified with your polishing, I am going to close this issue for now. Please feel free to re-open if you face any other issues with the new data.
Hi, I previously used pepper v0.1 to polish trio-binned ONT assemblies, where each haplotype is assembled and polished effectively as haploid from binned reads. Is there any possible approach to polishing triobinned assemblies, now that haploid polishing is explicitly discontinued in the P-M-DV pipeline? I guess some possibilities would be
None of these feel ideal, so perhaps the latest release of this pipeline isn't well suited to such haplotype-resolved assemblies. However, with the significant speed improvements and probably accuracy, this latest pipeline definitely seems like a superior approach.