Memory issue on phaser - Githubissues

everestial commented 6 years ago

I have realized that phaser seems to take so much of a ram memory. Of several samples I tried there were two samples where phaser really didn't go through even in 2 days after several tries. I set number of threads to 1. But, still no luck.

I was using a desktop with 16 gb memory and 20 gb swap and that was all taken up.
I am working with A. lyrata genome, vcf and bam files which is relatively small compared to human genome, so with human genome the problem should be even bigger.

Is there a way to fix memory handling in phaser ?

secastel commented 6 years ago

One issue I've seen that causes phASER to take up a lot of memory is if there are many chromosomes. This isn't a problem with Human, but for species that don't have well assembled genomes, and this have many contigs, it drives memory usage up. How many contigs do you have? I'd suggest to try running it per chromosome using the "--chr" argument and then aggregating the results.

everestial commented 6 years ago

Hmm, well I once thought so. Genome of the organism I am working with has main chromosomes but several other scaffolds, but overall genome size isn't big as humans.

Well, it is surprising to see that number of scaffold bloats the memory - why so? since the bam and VCF are going to be the same, rite. Is the program making copies of BAM and VCF as many as the number of scaffolds?
I thought about doing this in chromosome by chromosome mode, but it would require knowledge of the name of available chromosome. Plus, extra work of bash scripting.
So, I wrote some adding extra code on phaser.py. The purpose of the code is to find 1) unique chromosome/contig names in the input VCF file 2) split the vcf by chromosome and then pipe it one by one. To facilitate the process I also added an extra argument variable i.e process_slow.
So, in the process_slow mode the output will be produce all the files as is with an extra suffix (output prefix + contig name).
I have verified the code changes and results with your original phaser.py as phaserOrg.py and the updated phaser.py with the given files and code.

Original phaser.py as phaserOrg.py with code:

$ python phaserOrg.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultOrgPhaser/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0

##################################################
Welcome to phASER v1.0.1
Author: Stephane Castel (scastel@nygenome.org)
##################################################

#1. Loading heterozygous variants into intervals...
processing VCF...
creating variant mapping table...
55931 heterozygous sites being used for phasing (0 filtered, 4492 indels excluded, 60423 unphased)
#2. Retrieving reads that overlap heterozygous sites...
file: realigned_ms01e.chr2n3.bam
minimum mapq: 40
mapping reads to variants...
completed chromosome 3...
completed chromosome 2...
processing mapped reads...
using alignment score cutoff of 158
retrieved 2870876 reads
#3. Identifying connected variants...
calculating sequencing noise level...
sequencing noise level estimated at 0.000462
creating read sets...
generating read connectivity map...
testing variant connections versus noise...
4136 variant connections dropped because of conflicting configurations (threshold = 0.010000)
54890 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
Compressing and tabix indexing output VCF...
COMPLETED using 2870876 reads in 159 seconds using 1 threads
PHASED 53185 of 55931 all variants (= 0.950904) with at least one other variant
GENOME WIDE PHASED 0 of 60423 unphased variants (= 0.000000)
GENOME WIDE PHASE CORRECTED 0 of 55931 variants (= 0.000000)
Global maximum memory usage: 1145.59 (mb)

Updated phaser.py as phaser.py:

$ python phaser.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultUpdatedPhaser/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0 --process_slow 1

##################################################
Welcome to phASER v1.0.1
Author: Stephane Castel (scastel@nygenome.org)
##################################################

#1. Loading heterozygous variants into intervals...
processing VCF...

Processing the readbackphasing in memory efficient mode.
Splitting the VCF by chromosome/contigs.

creating variant mapping table...
23840 heterozygous sites being used for phasing (0 filtered, 1898 indels excluded, 25738 unphased)
#2. Retrieving reads that overlap heterozygous sites...
file: realigned_ms01e.chr2n3.bam
minimum mapq: 40
mapping reads to variants...
completed chromosome 2...
processing mapped reads...
using alignment score cutoff of 157
retrieved 1120185 reads
#3. Identifying connected variants...
calculating sequencing noise level...
sequencing noise level estimated at 0.000445
creating read sets...
generating read connectivity map...
testing variant connections versus noise...
2109 variant connections dropped because of conflicting configurations (threshold = 0.010000)
23413 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
Compressing and tabix indexing output VCF...
COMPLETED using 1120185 reads in 98 seconds using 1 threads
PHASED 22679 of 23840 all variants (= 0.951300) with at least one other variant
GENOME WIDE PHASED 0 of 25738 unphased variants (= 0.000000)
GENOME WIDE PHASE CORRECTED 0 of 23840 variants (= 0.000000)
Global maximum memory usage: 514.47 (mb)

creating variant mapping table...
32091 heterozygous sites being used for phasing (0 filtered, 2594 indels excluded, 34685 unphased)
#2. Retrieving reads that overlap heterozygous sites...
file: realigned_ms01e.chr2n3.bam
minimum mapq: 40
mapping reads to variants...
completed chromosome 3...
processing mapped reads...
using alignment score cutoff of 159
retrieved 1749129 reads
#3. Identifying connected variants...
calculating sequencing noise level...
sequencing noise level estimated at 0.000471
creating read sets...
generating read connectivity map...
testing variant connections versus noise...
2038 variant connections dropped because of conflicting configurations (threshold = 0.010000)
31483 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
Compressing and tabix indexing output VCF...
COMPLETED using 1749129 reads in 194 seconds using 1 threads
PHASED 30499 of 32091 all variants (= 0.950391) with at least one other variant
GENOME WIDE PHASED 0 of 34685 unphased variants (= 0.000000)
GENOME WIDE PHASE CORRECTED 0 of 32091 variants (= 0.000000)
Global maximum memory usage: 740.13 (mb)

You can see that there is not much delay when we use updated phaser.py, but better memory management with process_slow.
I hope this changes are compatible with big bam and vcf inputs too. Additionally, this make phaser tool more laptop friendly.
Also, there must be better way of splitting VCF - leaving it for future.
With process_slow output are reported for each contig separately which may be merged - for future.

I have submitted a pull request which should reflect the proposed changes. All the required input and output result files are shared via dropbox https://www.dropbox.com/sh/swhldim0y2h5r88/AAA9e-Bo471A_hRZV75t8TG1a?dl=0

secastel commented 6 years ago

Wow, thanks for doing this! I need some time to validate the changes with our test cases and make sure everything is good. I'll try to do this within the next few weeks. Thanks again!

everestial commented 6 years ago

You are welcome !

Following the conversation in our earlier issues - I had indicated that I was taking data from phaser and creating further phase improvement tools using the outputs (mainly the PI and PG field values).

I have created phase-Extender https://github.com/everestial/phase-Extender and phase-Stitcher https://github.com/everestial/pHASE-Stitcher and hoping that the combination of these tools provide a good foundation of haplotype phasing in emerging genomic models which have very small genomic resources available.

The tools would need some optimization and code cleaning but is already usable at the moment. I also need to do some phasing quality checkup (mainly switch errors).

Thanks,

secastel / phaser

Memory issue on phaser #44