kishwarshafin / pepper

PEPPER-Margin-DeepVariant
MIT License
242 stars 42 forks source link

Polishing trio-binned assemblies with PEPPER-Margin-DeepVariant #52

Closed ASLeonard closed 3 years ago

ASLeonard commented 3 years ago

Hi, I previously used pepper v0.1 to polish trio-binned ONT assemblies, where each haplotype is assembled and polished effectively as haploid from binned reads. Is there any possible approach to polishing triobinned assemblies, now that haploid polishing is explicitly discontinued in the P-M-DV pipeline? I guess some possibilities would be

None of these feel ideal, so perhaps the latest release of this pipeline isn't well suited to such haplotype-resolved assemblies. However, with the significant speed improvements and probably accuracy, this latest pipeline definitely seems like a superior approach.

kishwarshafin commented 3 years ago

@ASLeonard ,

The issue we have had with trio-binned assembly is coverage. If you take each haplotype and polish them independently the coverage becomes really difficult to solve. However, in this new pipeline, you should be able to use a phased bam (possibly diploid) and run PEPPER HP-DeepVariant (the last two modules) and get two VCFs that are haplotype specific. That way you are ignoring the PEPPER SNP-Margin to phase and replacing the phasing with trio binning. Let me know if you would be interested in that and I can give you commands to run.

ASLeonard commented 3 years ago

Yeah that would be great. I'm using the singularity image, and have played around previously with running the individual modules so it fits nicely into our cluster queues, so should be able to run the commands you suggest.

Thanks!

kishwarshafin commented 3 years ago

@ASLeonard ,

Sorry, I should have asked this before. Are you polishing a diploid genome?

ASLeonard commented 3 years ago

Yes diploid ~2.7g mammal. The basic approach is triobinning ~120x ONT into 60x coverage per haplotype using parental short reads, then assembling each with shasta. That is why I'd prefer haploid polishing, as the two haplotypes are already "separated".

kishwarshafin commented 3 years ago

Ok, would it be possible for you to tag your haplotype-1 reads with HP:1 tag and haplotype-2 reads with HP:2 tag in the bam? That way DeepVariant can utilize the full coverage and see the difference between two haplotypes. You have to generate a tab delimited file in this format:

read_id1   none
read_id2   H1
read_id3   H2

It has to be "read_id\t[none|H1|H2] tab delimited.

Then you can use the tagBam executable available from the docker/singularity of PEPPER-DeepVariant.

docker run --ipc=host \
-v /data:/data \
kishwars/pepper_deepvariant:r0.4 \
tagBam \
IN_BAM_FILE \
TAG_BED_FILE \
OUT_BAM_FILE
ASLeonard commented 3 years ago

So this approach would be using the entire read set to map against the diploid genome, followed by haplotype tagging based on the binned read IDs? And then using this bam as the input for the pepper_hp call_variants followed by run_deepvariant?

kishwarshafin commented 3 years ago

That is correct. In assembly polishing mode, PEPPER-DeepVariant is haplotype sensitive but uses all reads. That means PEPPER will pick candidates that strictly belong to haplotype-1 or haplotype-2. This is how it works: 1) PEPPER HP generates two candidates HP1.vcf and HP2.vcf. HP1.vcf are all of the candidates that PEPPER found from (HP:1+HP:0) reads and HP2.vcf will be all of the candidates from haplotype-2 (HP:2+HP:0) reads. 2) DeepVariant HP1 will run on HP1.vcf, this mode will sort the reads in a way that have HP:1 reads on top but it uses all reads in the pileup 3) DeepVariant HP2 will run on HP2.vcf, this mode will sort the reads to have HP:2 reads on top and uses all reads.

This way you get information from both haplotypes while polishing, that's how we were able to achieve such high accuracy in the human genomes.

In your setup: You should: 1) Align all of the reads to Shasta_hap1.fasta 2) Tag the reads with their haplotags you got by trio-binning 3) Run PEPPER HP and get the HP1.vcf 4) Run DeepVariant HP1 and get Shasta_pepper_deepvariant_hap1.fasta

Then repeat: 1) Align all of the reads to Shasta_hap2.fasta 2) Tag the reads with their haplotags you got by trio-binning 3) Run PEPPER HP and get the HP2vcf 4) Run DeepVariant HP2 and get Shasta_pepper_deepvariant_hap2.fasta

You can further restrict your polishing by excluding HP:0 reads, but I'd suggest not to as trio-binning will be limited to easy-to-map regions because of short-reads. Does this plan fit you polishing experiment?

ASLeonard commented 3 years ago

Thanks, I've tried giving that a go. Everything was working okay up to step 4, running deepvariant. This step seems to be running incredibly slow compared to expectations.

I'm not sure if it is related to the proposed variants coming from pepper, but I'm seeing something like 50-100x slowdown compared to when I've run deepvariant itself on other data. Each logging interval in the make_examples step is taking around 70 seconds, compared to something like 1 second I've seen in deepvariant.

Perhaps an issue is the 120x diploid coverage, rather than using the 60x triobinned coverage. It doesn't necessarily explain why make_examples is running slower, but I've already spent more time on deepvariant make_examples alone (which was killed before finishing by cluster time limit) than the preprint suggests for 75x coverage ONT on 96 vCPU.

I believe the --hp_tag_for_assembly_polishing "1" is unique to the pepper version, so I can't explicitly compare with deepvariant. If you have any suggestions, I'll be happy to try.

ASLeonard commented 3 years ago

Also before your suggestions, I had already got to the run_deepvariant step running only the binned data on the haplotype. This also was killed by time limit after 24 hours on 24 cpus, so it doesn't appear like any configuration with this data is likely to complete within the reported runtimes.

kishwarshafin commented 3 years ago

@ASLeonard ,

Assembly polishing is different than variant calling so we changed DeepVariant accordingly to solve the problem better. The --hp_tag_for_assembly_polishing "1" flag sorts the images based on their haplotype so you can use all reads while calling on one haplotype. For the other haplotype (HP:2), it'll be --hp_tag_for_assembly_polishing "2".

One of the things you can quickly try is to increase paritition_size in the make_examples args to 10kb, making it look like: --make_examples_extra_args=partition_size=10000,.... Right now it is set to 1kb so the process reads the sequencing data many many times. We have found this issue internally and will hopefully fix it in the next release. Please don't set it too high then you will end up having chunks that will take much longer to finish.

Also, if you could kindly provide me some information, that'd be very helpful: 1) Which assembler did you use to assemble the genome? 2) How many contigs did you get per haplotype? If collapsed then how many total contigs? 3) How does the haplotype separation look: how many reads are in HP1, HP2, HP0 groups? 4) Did you do trio-binning, assembly, lift-over HP tags method? If yes, can you confirm the bams are now tagged? A simple samtools view should be able to verify this. 5) How many candidates are reported by PEPPER? You can find it with bcftools stats, if not, a simple zcat PEPPER_HP.vcf.gz | grep -v '#' | wc -l would do.

ASLeonard commented 3 years ago

I gave the larger partition_size a try, the initial results look like about 40s per interval, so still over an order of magnitude slower than the worst deepvariant performance I've seen.

For the other information, this particular assembly is

  1. Canu v2.1
  2. 892 contigs for this haplotype
  3. Pretty even, there were 6841607 HP1, and 7692837 HP2, just a bit less (<0.01%) than the haplotype binned read counts, meaning almost all reads mapped and were not filtered by the recommended -F 0x904
  4. Yes, this is trio-bin, assemble, align for bam, tagBam using binned reads. The bam file correctly contains HP:i:1 and HP:i:2 tags. Less than 0.02% of reads (accounting for 0.001% of the bases) could not be binned, so I didn't align these reads and thus there are no HP0 reads.
  5. Candidates are below
    6SN 0   number of records:  15316066
    SN  0   number of no-ALTs:  0
    SN  0   number of SNPs: 1704847
    SN  0   number of MNPs: 0
    SN  0   number of indels:   13670957
    SN  0   number of others:   0
    SN  0   number of multiallelic sites:   631262
    SN  0   number of multiallelic SNP sites:   10757
kishwarshafin commented 3 years ago

@ASLeonard ,

Thank you! The number of INDELs look very high. 13670957 would mean you are starting from an assembly that is about Q26/Q27. What is the basecaller version that you are using?

ASLeonard commented 3 years ago

The merqury QV of the unpolished assembly is Q28, so pretty bang on your estimate! I believe the data is between guppy 3-3.5.

kishwarshafin commented 3 years ago

Ok, thanks, that is a bit low quality for the genome assemblies we generally start with. The best assemblies come with guppy v3.6.0 or higher as it resolves many known systematic issues with ONT and provides a decent assembly accuracy.

I may have to tune the parameters for you to reduce the total number of variants reported by PEPPER. That is what causes the delay in your polishing. Make examples step itself would take 3 days with 28CPUs. Do you have additional resources that you can use? Otherwise, I can give you a modified version where parameters are tuned to lower the number of proposed candidates, but I'd worry about the downstream effect of such change.

kishwarshafin commented 3 years ago

Just for a reference. Your reads are somewhere around the green distribution here, so it has much more errors than the reads we trained our models on. So PEPPER is picking up a lot of error candidates which is causing things to be slower. We expect things to be around the purple distribution.

Screen Shot 2021-05-04 at 1 52 43 PM
kishwarshafin commented 3 years ago

@ASLeonard as the issue was correctly identified with your polishing, I am going to close this issue for now. Please feel free to re-open if you face any other issues with the new data.