secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
105 stars 36 forks source link

Memory issue on phaser #44

Closed everestial closed 4 years ago

everestial commented 6 years ago

I have realized that phaser seems to take so much of a ram memory. Of several samples I tried there were two samples where phaser really didn't go through even in 2 days after several tries. I set number of threads to 1. But, still no luck.

Is there a way to fix memory handling in phaser ?

secastel commented 6 years ago

One issue I've seen that causes phASER to take up a lot of memory is if there are many chromosomes. This isn't a problem with Human, but for species that don't have well assembled genomes, and this have many contigs, it drives memory usage up. How many contigs do you have? I'd suggest to try running it per chromosome using the "--chr" argument and then aggregating the results.

everestial commented 6 years ago

Hmm, well I once thought so. Genome of the organism I am working with has main chromosomes but several other scaffolds, but overall genome size isn't big as humans.

Original phaser.py as phaserOrg.py with code:

$ python phaserOrg.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultOrgPhaser/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0

##################################################
              Welcome to phASER v1.0.1
  Author: Stephane Castel (scastel@nygenome.org)
##################################################

#1. Loading heterozygous variants into intervals...
     processing VCF...
     creating variant mapping table...
          55931 heterozygous sites being used for phasing (0 filtered, 4492 indels excluded, 60423 unphased)
#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 3...
               completed chromosome 2...
          processing mapped reads...
          using alignment score cutoff of 158
          retrieved 2870876 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000462
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     4136 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     54890 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...
COMPLETED using 2870876 reads in 159 seconds using 1 threads
     PHASED  53185 of 55931 all variants (= 0.950904) with at least one other variant
     GENOME WIDE PHASED  0 of 60423 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 55931 variants (= 0.000000)
     Global maximum memory usage: 1145.59 (mb)


Updated phaser.py as phaser.py:

$ python phaser.py --bam realigned_ms01e.chr2n3.bam --vcf phased.MySpF1.chr2n3.vcf.gz --sample ms01e --o ResultUpdatedPhaser/ms01e --mapq 40 --baseq 10 --paired_end 1 --id_separator - --pass_only 0 --process_slow 1

##################################################
              Welcome to phASER v1.0.1
  Author: Stephane Castel (scastel@nygenome.org)
##################################################

#1. Loading heterozygous variants into intervals...
     processing VCF...

     Processing the readbackphasing in memory efficient mode.
     Splitting the VCF by chromosome/contigs. 

     creating variant mapping table...
          23840 heterozygous sites being used for phasing (0 filtered, 1898 indels excluded, 25738 unphased)
#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 2...
          processing mapped reads...
          using alignment score cutoff of 157
          retrieved 1120185 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000445
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     2109 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     23413 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...
COMPLETED using 1120185 reads in 98 seconds using 1 threads
     PHASED  22679 of 23840 all variants (= 0.951300) with at least one other variant
     GENOME WIDE PHASED  0 of 25738 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 23840 variants (= 0.000000)
     Global maximum memory usage: 514.47 (mb)

     creating variant mapping table...
          32091 heterozygous sites being used for phasing (0 filtered, 2594 indels excluded, 34685 unphased)
#2. Retrieving reads that overlap heterozygous sites...
     file: realigned_ms01e.chr2n3.bam
          minimum mapq: 40
          mapping reads to variants...
               completed chromosome 3...
          processing mapped reads...
          using alignment score cutoff of 159
          retrieved 1749129 reads
#3. Identifying connected variants...
     calculating sequencing noise level...
     sequencing noise level estimated at 0.000471
     creating read sets...
     generating read connectivity map...
     testing variant connections versus noise...
     2038 variant connections dropped because of conflicting configurations (threshold = 0.010000)
     31483 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
     GT field is not being updated with phASER genome wide phase. This can be changed using the --gw_phase_vcf argument.
     Compressing and tabix indexing output VCF...
COMPLETED using 1749129 reads in 194 seconds using 1 threads
     PHASED  30499 of 32091 all variants (= 0.950391) with at least one other variant
     GENOME WIDE PHASED  0 of 34685 unphased variants (= 0.000000)
     GENOME WIDE PHASE CORRECTED  0 of 32091 variants (= 0.000000)
     Global maximum memory usage: 740.13 (mb)

I have submitted a pull request which should reflect the proposed changes. All the required input and output result files are shared via dropbox https://www.dropbox.com/sh/swhldim0y2h5r88/AAA9e-Bo471A_hRZV75t8TG1a?dl=0

secastel commented 6 years ago

Wow, thanks for doing this! I need some time to validate the changes with our test cases and make sure everything is good. I'll try to do this within the next few weeks. Thanks again!

everestial commented 6 years ago

You are welcome !

Following the conversation in our earlier issues - I had indicated that I was taking data from phaser and creating further phase improvement tools using the outputs (mainly the PI and PG field values).

I have created phase-Extender https://github.com/everestial/phase-Extender and phase-Stitcher https://github.com/everestial/pHASE-Stitcher and hoping that the combination of these tools provide a good foundation of haplotype phasing in emerging genomic models which have very small genomic resources available.

The tools would need some optimization and code cleaning but is already usable at the moment. I also need to do some phasing quality checkup (mainly switch errors).

Thanks,