vibansal / HapCUT2

software tools for haplotype assembly from sequence data
BSD 2-Clause "Simplified" License
206 stars 36 forks source link

hapCUT2 running for excessive amount of time #73

Closed mdem7705 closed 5 years ago

mdem7705 commented 5 years ago

Greetings hapCUT2 team,

I am having an issue running hapCUT2 where the program runs forever and never completes - I don't ever see an error message besides running out of walltime (I have had this running for longer than 4 straight days).

My reference genome and bam file are both from pacbio reads. My reference genome has been haplophased using haploMerger2, and my pacbio reads bam file has been aligned to my reference with minimap2 and sorted with samtools. My vcf file is generated from illumina reads on the same organism with variants called using gatk's haplotypeCaller.

Here is the code I am trying to run:

extractHAIRS --pacbio 1 --ref reference_pacbio_HM2.fa --bam pacbio_mm2_aligned_sorted.bam  --VCF illumina_gatk_variants.vcf --out fragment_file

HAPCUT2 --ea 1 --fragments fragment_file --vcf illumina_gatk_variants.vcf --output haplotype_hc2.phased.vcf

I suspect there might be some incompatibility with my files causing this program to run forever and never complete, but I don't get an error message so I don't know how to diagnose this. Any help would be appreciated!

Cheers!

vibansal commented 5 years ago

I would recommend to test the program on one contig/chromosome to see if runs to completion.

mdem7705 commented 5 years ago

Hey there,

I tried it with my first contig, about 6 Mbp long. I allowed it 3 hours to run, and it never completed. The only meaningful error message I have again is killed due to excessive walltime, similar to my job that ran for 90+ hours. The fragment file is also not empty. Please let me know if I can provide anything to help!

vibansal commented 5 years ago

The '--ea' option may be causing the long runtime. Please run without this option.

mdem7705 commented 5 years ago

Just tried it. It still runs for over 3h on a single contig.

vibansal commented 5 years ago

If you can share the input data for a single contig, I can take a look. My email is vibansal AT ucsd.edu

mdem7705 commented 5 years ago

I have sent you a link to a dropbox folder with the requested data.

vibansal commented 5 years ago

I have looked at the data and the output files. I was also able to successfully run HapCUT2 on a small subset of the data. It appears that your data has a very high density of heterozygous variants. HapCUT2 will run on such datasets but it will be slow since the complexity of HapCUT2 scales quadratically with the number of variants per read. You can reduce the number of iterations (--converge option) to 3 or 4. In my estimate, it should finish within 6 hours on the single contig.

Thanks for sharing your data. We will think about implementing a solution to speed up hapCUT2 on datasets with a very high density of variants.

mdem7705 commented 5 years ago

Another question considering that I have 130+ contigs to get through - if I split the genome up by contig, run hapCUT2 separately on each one, and then combine the results at the end, will that affect the overall result?

vibansal commented 5 years ago

No. It is recommended to run the program independently on each contig/chromosome.