tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Memory allocation issue #55

Closed willibry closed 6 years ago

willibry commented 6 years ago

Hi,

I am trying to genotype around 1.2 million STRs across 861 samples.

I'm running HipSTR in parallel (1 CPU per chromosome, total of 23), and it runs fine for some time, but during this step:

Realigning 3567 out of 4057 read pools to polish flanking sequences

when the number of read pools are fairly high, I get this error and HipSTR stops:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

As far as I have understood, this is a memory allocation issue. I have access to 1.5 TB of memory, and each job is using 3-4 GB, so I should be fine memory-wise. Any suggestion on how to solve this?

tfwillems commented 6 years ago

Hi @willibry, Thank you for reporting this issue. Can you do the following to help narrow things down:

  1. Verify you're using v0.6.1 by typing ./HipSTR --version

  2. Post the full command you used during your problematic run

  3. Post the full log messages for the problematic locus. This would just extend from the last instance of Processing region... until the Realigning message you noted above

Hopefully that will clarify whether this was potentially a memory issue (which seems likely ) and we can work from there

Best, Thomas

willibry commented 6 years ago

Thanks for your reply.

  1. I verified that I'm using HipSTR version v0.6.1.

  2. I ran this script, run_hipstr.sh, per chromosome:

#!/bin/bash

# Modules
module load xz

# Variables
HipSTR=/projects/cees/bin/HipSTR/HipSTR/HipSTR
BAMLIST=/node/work2/no_backup/william/analysis/hipstr/fileListBAMs.txt
GENOME=/node/work2/no_backup/william/data/genome.fasta
REGIONS=/node/work2/no_backup/william/scripts/hipstr_genes_only/genome.fasta.phobos.exact.gff.overlaps_genes.123.bed
CHR=$1

$HipSTR \
--chrom $1 \
--bam-files $BAMLIST \
--fasta $GENOME \
--regions $REGIONS \
--str-vcf $1.hipstr.vcf.gz

This is an example of a run with the error:

Processing region LG17 4106721 4106755
9603 reads overlapped region, of which 
    10 were hard clipped
    16 had an 'N' base call
    0 had low base quality scores
    75 did not have a unique mapping
    260 did not have a mate pair
    9242 PASSED ALL FILTERS
Found 9242 fully paired reads and 0 unpaired reads for downstream analyses
Removed 0 sets of PCR duplicate reads
Phased SNPs add info for 0 out of 9242 reads and 0 out of 857 samples
Building EM stutter model
Training EM stutter model
Learned stutter model 
    IN_FRAME [P_GEOM(rep)=0.414979, P_DOWN=0.0440448, P_UP=0.0911816]
    OUT_FRAME[P_GEOM(bp) =0.893676, P_DOWN=0.000335709, P_UP=0.000702465]
Left aligning reads
Generating candidate haplotypes
    TGATTTTCCACCATG...AGAGTGAGAGTG AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC             CGACCGACCGACCCC...ACGCTGTTGACA 
                                   AGAGAGAGAGAGAGAGAGAGAGA                                                                 
                                   AGAGAGAGAGAGAGAGAGAGAGAGAGAC                                                            
                                   AGAGTGAGAGAGAGAGAGAGAGACCGAC                                                            
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAC                                                            
                                   AGAGTGAGAGAGAGATTGAGAGAGAGAGA                                                           
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAC                                                          
                                   GAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                                         
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAC                                                        
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGA                                                       
                                   AGAGTGAGAGAGAGAGAGAGCGAGAGAGAGAGA                                                       
                                   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                                       
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                                      
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                                      
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAC                                                      
                                   AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA                                                     
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA                                                     
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGA                                                     
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                                    
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA                                                   
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGA                                                   
                                   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                                   
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                                  
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                                  
                                   AGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAGA                                                 
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA                                                 
                                   AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                                
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                                
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                                
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAGA                                               
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA                                               
                                   AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                              
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                              
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                              
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                              
                                   AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                            
                                   AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                            
                                   AGAGAGAGATAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                            
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                            
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG                                            
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG                                            
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                            
                                   AGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                            
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                          
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG                                          
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                          
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG                                          
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                        
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC                                        
                                   AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG                                      
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                    
                                   AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC                                
Aligning reads to each candidate haplotype
Identified 33 additional candidate alleles from stutter artifacts
    AGAGAGAGAGAGAGAGAGAGAGAGA
    AGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGGGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
    AGAGTGAGAGAGAGAGAGAGCGAGAGAGAGAGAGA
    AGAGTGAGAGAGAGATTGAGAGAGAGAGAGA
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
Identified 11 additional candidate alleles from stutter artifacts
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
    AGAGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
Identified 3 additional candidate alleles from stutter artifacts
    AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAG
    AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAG
Recomputing sample posteriors after removing 15 uncalled alleles across 1 blocks
Reassembling flanking sequences
    Pruning low frequency left flank    TGATTCTCCACCATGAGAGAGAGAGAGTGAGAGTG 3
    Pruning low frequency left flank    TGATTCTCCACCATGAGAGTGAGAGAGTGAGAGAG 1
    Pruning low frequency left flank    TGATTTTCCACCATGAGAGTGAGAGAGAGAGAGTG 1
    Pruning low frequency left flank    TGATTTTCCACCATGAGAGTGAGAGAGTGAGAGAG 1
    Pruning low frequency left flank    TGATTTTCCACCATGAGAGTGAGAGAGTGAGAGGG 1
    Pruning low frequency left flank    TGATTTTCCACCATGAGAGTGAGAGAGTGATAGTG 1
    Pruning low frequency left flank    TGATTTTCCACCATGAGAGTGAGAGAGTTAGAGTG 1
    Pruning low frequency left flank    TGATTTTCCACCATTAGAGTGAGAGAGTGAGAGTG 1
Identified 3 new left flank haplotype(s)
    TGATTCTCCACCATGAGAGTGAGAGAGTGAGAGTG 238
    TGATTCTCCACCATTAGAGTGAGAGAGTGAGAGTG 38
    TGATTCTCCACCGTGAGAGTGAGAGAGTGAGAGTG 21
    TGATTTTCCACCATGAGAGTGAGAGAGTGAGAGTG REF_SEQ

    Pruning low frequency right flank   AGACCGACCGACCCCAGGTCCCAACGCTGTTGACA 1
    Pruning low frequency right flank   CAACCGACCGACCCCAGGTCCTAACGCTGTTGACA 7
    Pruning low frequency right flank   CGACCGACCGACCCCAGGTCCCAACGCTGTTGACA 2
    Pruning low frequency right flank   CGACCGACGGACCCCAGGTCCTAACGCTGTTGACA 1
Identified 4 new right flank haplotype(s)
    AGACCGACCGACCCCAGGTCCTAACGCTGTTGACA 60
    AGACCGACCGACCCCGGGTCCTAACGCTGTTGACA 84
    CAACCGACCGACCCCGGGTCCTAACGCTGTTGACA 180
    CGACCGACCGACCCCAGGTCCTAACGCTGTTGACA 239
    CGACCGACCGACCCCGGGTCCTAACGCTGTTGACA REF_SEQ

Realigning 3935 out of 5105 read pools to polish flanking sequences
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
run_hipstr.sh: line 19: 50835 Aborted                 (core dumped) $HipSTR --chrom $1 --bam-files $BAMLIST --fasta $GENOME --regions $REGIONS --str-vcf $1.hipstr.vcf.gz

Best, William

tfwillems commented 6 years ago

Hi @willibry,

Thank you for providing such detailed information. It certainly appears that this is likely a memory allocation issue. This is quite an unusual STR, as there are an incredibly high number of candidate alleles. You can see this clearly by the initial number of candidate haplotypes (>50), followed by the many many additional alleles found after that (another 40ish).

Typically a human STR has a handful of alleles or tens of alleles if it's highly polymorphic. I'm doubtful there are in fact so many alleles in a population of just 800 individuals, so it may be that many reads are mismapped to this locus. Nonetheless, what's causing the crash is essentially the following:

  1. HipSTR looks for candidate STR alleles, of which it found around 100
  2. It then looks for candidate sequences upstream and downstream of the STR, for which it finds 4 candidate left sequences and 5 candidate right sequences.
  3. It then considers all possible haplotype combinations, resulting in H = 4 x 100 x 5 = ~2000 candidate haplotypes. For a typical human STR, H is usually on the order of 5-20.

For realigning reads, memory usage scales in terms of the the number of reads and haplotypes, H x R For calculating genotype probabilities, memory scales in terms of samples and haplotypes, S x H^2. So as the number of candidate haplotypes explodes as it has in this case, either of these steps is likely to require too much memory and result in your error.

Do you see many STRs that result in this type of memory crash, or is this the only example you've found? If you do see many, how good is the reference genome you're aligning to and what aligner are you using? I suspect these types of issues could be caused by many reads misaligning to a single locus and artificially inflating the number of haplotypes. If you don't see many, the simplest fix might be to just remove this pathological locus from your BED file and continue.

Happy to discuss potential fixes we could make within HipSTR to skip such extreme STRs once we discuss the above

Best, Thomas

willibry commented 6 years ago

Hi @tfwillems,

Thank you for the explanation. Yes, there are quite a few STRs like this - it is hard to say exactly how many, as HipSTR stops after the first one per chromosome.

The reference genome has a scaffold N50 of 1.15 Mbp (for more info please see https://doi.org/10.1186/s12864-016-3448-x) and quite a high density of STRs (~ 11 %), which also made assembly a bit troublesome. BWA-MEM was used for mapping and alignment. I guess the high density of STRs can lead to mismapping issues as well and thereby inflate the number of variants. On the other hand, this is a non-model organism and it is perhaps hard to know what STR variant distribution to expect.

Best, William

tfwillems commented 6 years ago

Hi @willibry,

Apologies it has taken me so long to get back to you. If you're still interested in using HipSTR, I've added a --max-haps command line option that should allow you to skip STRs that have too many haplotypes. This should hopefully resolve the memory explosion issues you've seen.

Redownloading HipSTR from github, compiling, and rerunning your scripts with --max-haps 100 would effectively skip over any loci that seem to have more than 100 candidate haplotypes. When these loci are encountered, HipSTR will output a message to the log file: Aborting genotyping of the locus as too many candidate haplotypes were found (# Found = 120, MAX = 100)

If you do try this out, I'd love to hear if 1) it resolves your issues and 2) how many loci seem to have a number of candidate haplotypes above this threshold

Best, Thomas

willibry commented 6 years ago

Hi @tfwillems,

Thank you for adding the --max-haps functionality to HipSTR, it absolutely resolved the issue in that I can successfully genotype all the non-problematic loci.

I have been waiting for HipSTR runs to finish to answer your second question. So far, four chromosomes are done. For these, the line Aborting genotyping of the locus as too many candidate haplotypes were found appears between 2555 and 2846 times, so they appear to be quite common.

Best, William

tfwillems commented 6 years ago

Hi @willibry,

Glad to hear that did the trick! For the chromosomes where ~2500 loci were skipped due to this issue, how many loci were successfully genotyped?

The STR references I've built in the past for HipSTR used TandemRepeatsFinder as described here. In general, we typically avoid STRs that are right next to one another (e.g. ATAG....ATAG next to AC.....AC) as these will be quite tricky to genotype and will likely result in an explosion of haplotypes. The scripts in the referenced repo should make it fairly easy for you to build a new STR reference using TRF. It might be worth generating an STR reference with those scripts and seeing whether that seems to result in fewer skipped STRs.

For now, I'll close this issue, but feel free to shoot me an email!

Best, Thomas

willibry commented 6 years ago

Hi @tfwillems,

This is the results for all but one chromosome. The number after the file name gives the number of problematic loci.

LG17.out 2616 Genotyping succeeded for 44236/57400 loci
LG23.out 2555 Genotyping succeeded for 55172/71100 loci
LG02.out 2846 Genotyping succeeded for 59602/75492 loci
LG13.out 2656 Genotyping succeeded for 60180/75459 loci
LG20.out 2748 Genotyping succeeded for 54216/68437 loci
LG15.out 3517 Genotyping succeeded for 61095/77858 loci
LG10.out 2785 Genotyping succeeded for 53276/68294 loci
LG19.out 3143 Genotyping succeeded for 50738/65770 loci
LG21.out 2877 Genotyping succeeded for 56728/72861 loci
LG22.out 3175 Genotyping succeeded for 58277/75511 loci
LG12.out 2822 Genotyping succeeded for 66587/84580 loci
LG06.out 3004 Genotyping succeeded for 63233/79658 loci
LG05.out 3497 Genotyping succeeded for 59529/76230 loci
LG18.out 3472 Genotyping succeeded for 59180/76570 loci
LG16.out 3089 Genotyping succeeded for 67271/84995 loci
LG08.out 3598 Genotyping succeeded for 65699/83205 loci
LG11.out 3085 Genotyping succeeded for 73401/92947 loci
LG09.out 3289 Genotyping succeeded for 63468/80986 loci
LG14.out 3651 Genotyping succeeded for 64331/82735 loci
LG01.out 3021 Genotyping succeeded for 75540/96192 loci
LG03.out 3947 Genotyping succeeded for 73068/94032 loci
LG07.out 4184 Genotyping succeeded for 76191/97230 loci

I'll try building a new reference using the scripts you suggested. The reason I chose to use Phobos for detecting STRs in the reference genome is that I could limit the output to exact (pure) repeats only, but I see now that the fix_trf_output.py script actually takes care of this.

Best, William

tfwillems commented 6 years ago

Hi @willibry,

Thanks for the update! That's definitely a lot more than I would expect based on my prior experience, but again that's primary been using human data.

I'd love to see whether this holds up if you get to test the other reference. Although those scripts don't restrict to pure repeats, they do require a minimum repeat score to filter out particularly short/interrupted STRs.

If it does hold up, I think this primarily suggests that either i) there are widespread mapping errors to that reference or ii) it's a real reflection of the diversity of repeat alleles in this population. Either way, super interesting!

Best, Thomas