Closed willibry closed 6 years ago
Hi @willibry, Thank you for reporting this issue. Can you do the following to help narrow things down:
Verify you're using v0.6.1 by typing ./HipSTR --version
Post the full command you used during your problematic run
Post the full log messages for the problematic locus. This would just extend from the last instance of Processing region... until the Realigning message you noted above
Hopefully that will clarify whether this was potentially a memory issue (which seems likely ) and we can work from there
Best, Thomas
Thanks for your reply.
I verified that I'm using HipSTR version v0.6.1.
I ran this script, run_hipstr.sh, per chromosome:
#!/bin/bash
# Modules
module load xz
# Variables
HipSTR=/projects/cees/bin/HipSTR/HipSTR/HipSTR
BAMLIST=/node/work2/no_backup/william/analysis/hipstr/fileListBAMs.txt
GENOME=/node/work2/no_backup/william/data/genome.fasta
REGIONS=/node/work2/no_backup/william/scripts/hipstr_genes_only/genome.fasta.phobos.exact.gff.overlaps_genes.123.bed
CHR=$1
$HipSTR \
--chrom $1 \
--bam-files $BAMLIST \
--fasta $GENOME \
--regions $REGIONS \
--str-vcf $1.hipstr.vcf.gz
This is an example of a run with the error:
Processing region LG17 4106721 4106755
9603 reads overlapped region, of which
10 were hard clipped
16 had an 'N' base call
0 had low base quality scores
75 did not have a unique mapping
260 did not have a mate pair
9242 PASSED ALL FILTERS
Found 9242 fully paired reads and 0 unpaired reads for downstream analyses
Removed 0 sets of PCR duplicate reads
Phased SNPs add info for 0 out of 9242 reads and 0 out of 857 samples
Building EM stutter model
Training EM stutter model
Learned stutter model
IN_FRAME [P_GEOM(rep)=0.414979, P_DOWN=0.0440448, P_UP=0.0911816]
OUT_FRAME[P_GEOM(bp) =0.893676, P_DOWN=0.000335709, P_UP=0.000702465]
Left aligning reads
Generating candidate haplotypes
TGATTTTCCACCATG...AGAGTGAGAGTG AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC CGACCGACCGACCCC...ACGCTGTTGACA
AGAGAGAGAGAGAGAGAGAGAGA
AGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGATTGAGAGAGAGAGA
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAC
GAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGA
AGAGTGAGAGAGAGAGAGAGCGAGAGAGAGAGA
GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGA
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGA
GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAGA
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAGA
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGAGATAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
Aligning reads to each candidate haplotype
Identified 33 additional candidate alleles from stutter artifacts
AGAGAGAGAGAGAGAGAGAGAGAGA
AGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGGGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGCGAGAGAGAGAGAGA
AGAGTGAGAGAGAGATTGAGAGAGAGAGAGA
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGATTGAGAGAGAGAG
Identified 11 additional candidate alleles from stutter artifacts
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAC
AGAGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAC
Identified 3 additional candidate alleles from stutter artifacts
AGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAG
AGAGTGAGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACCGAG
Recomputing sample posteriors after removing 15 uncalled alleles across 1 blocks
Reassembling flanking sequences
Pruning low frequency left flank TGATTCTCCACCATGAGAGAGAGAGAGTGAGAGTG 3
Pruning low frequency left flank TGATTCTCCACCATGAGAGTGAGAGAGTGAGAGAG 1
Pruning low frequency left flank TGATTTTCCACCATGAGAGTGAGAGAGAGAGAGTG 1
Pruning low frequency left flank TGATTTTCCACCATGAGAGTGAGAGAGTGAGAGAG 1
Pruning low frequency left flank TGATTTTCCACCATGAGAGTGAGAGAGTGAGAGGG 1
Pruning low frequency left flank TGATTTTCCACCATGAGAGTGAGAGAGTGATAGTG 1
Pruning low frequency left flank TGATTTTCCACCATGAGAGTGAGAGAGTTAGAGTG 1
Pruning low frequency left flank TGATTTTCCACCATTAGAGTGAGAGAGTGAGAGTG 1
Identified 3 new left flank haplotype(s)
TGATTCTCCACCATGAGAGTGAGAGAGTGAGAGTG 238
TGATTCTCCACCATTAGAGTGAGAGAGTGAGAGTG 38
TGATTCTCCACCGTGAGAGTGAGAGAGTGAGAGTG 21
TGATTTTCCACCATGAGAGTGAGAGAGTGAGAGTG REF_SEQ
Pruning low frequency right flank AGACCGACCGACCCCAGGTCCCAACGCTGTTGACA 1
Pruning low frequency right flank CAACCGACCGACCCCAGGTCCTAACGCTGTTGACA 7
Pruning low frequency right flank CGACCGACCGACCCCAGGTCCCAACGCTGTTGACA 2
Pruning low frequency right flank CGACCGACGGACCCCAGGTCCTAACGCTGTTGACA 1
Identified 4 new right flank haplotype(s)
AGACCGACCGACCCCAGGTCCTAACGCTGTTGACA 60
AGACCGACCGACCCCGGGTCCTAACGCTGTTGACA 84
CAACCGACCGACCCCGGGTCCTAACGCTGTTGACA 180
CGACCGACCGACCCCAGGTCCTAACGCTGTTGACA 239
CGACCGACCGACCCCGGGTCCTAACGCTGTTGACA REF_SEQ
Realigning 3935 out of 5105 read pools to polish flanking sequences
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
run_hipstr.sh: line 19: 50835 Aborted (core dumped) $HipSTR --chrom $1 --bam-files $BAMLIST --fasta $GENOME --regions $REGIONS --str-vcf $1.hipstr.vcf.gz
Best, William
Hi @willibry,
Thank you for providing such detailed information. It certainly appears that this is likely a memory allocation issue. This is quite an unusual STR, as there are an incredibly high number of candidate alleles. You can see this clearly by the initial number of candidate haplotypes (>50), followed by the many many additional alleles found after that (another 40ish).
Typically a human STR has a handful of alleles or tens of alleles if it's highly polymorphic. I'm doubtful there are in fact so many alleles in a population of just 800 individuals, so it may be that many reads are mismapped to this locus. Nonetheless, what's causing the crash is essentially the following:
For realigning reads, memory usage scales in terms of the the number of reads and haplotypes, H x R For calculating genotype probabilities, memory scales in terms of samples and haplotypes, S x H^2. So as the number of candidate haplotypes explodes as it has in this case, either of these steps is likely to require too much memory and result in your error.
Do you see many STRs that result in this type of memory crash, or is this the only example you've found? If you do see many, how good is the reference genome you're aligning to and what aligner are you using? I suspect these types of issues could be caused by many reads misaligning to a single locus and artificially inflating the number of haplotypes. If you don't see many, the simplest fix might be to just remove this pathological locus from your BED file and continue.
Happy to discuss potential fixes we could make within HipSTR to skip such extreme STRs once we discuss the above
Best, Thomas
Hi @tfwillems,
Thank you for the explanation. Yes, there are quite a few STRs like this - it is hard to say exactly how many, as HipSTR stops after the first one per chromosome.
The reference genome has a scaffold N50 of 1.15 Mbp (for more info please see https://doi.org/10.1186/s12864-016-3448-x) and quite a high density of STRs (~ 11 %), which also made assembly a bit troublesome. BWA-MEM was used for mapping and alignment. I guess the high density of STRs can lead to mismapping issues as well and thereby inflate the number of variants. On the other hand, this is a non-model organism and it is perhaps hard to know what STR variant distribution to expect.
Best, William
Hi @willibry,
Apologies it has taken me so long to get back to you. If you're still interested in using HipSTR, I've added a --max-haps command line option that should allow you to skip STRs that have too many haplotypes. This should hopefully resolve the memory explosion issues you've seen.
Redownloading HipSTR from github, compiling, and rerunning your scripts with --max-haps 100 would effectively skip over any loci that seem to have more than 100 candidate haplotypes. When these loci are encountered, HipSTR will output a message to the log file:
Aborting genotyping of the locus as too many candidate haplotypes were found (# Found = 120, MAX = 100)
If you do try this out, I'd love to hear if 1) it resolves your issues and 2) how many loci seem to have a number of candidate haplotypes above this threshold
Best, Thomas
Hi @tfwillems,
Thank you for adding the --max-haps functionality to HipSTR, it absolutely resolved the issue in that I can successfully genotype all the non-problematic loci.
I have been waiting for HipSTR runs to finish to answer your second question. So far, four chromosomes are done. For these, the line Aborting genotyping of the locus as too many candidate haplotypes were found
appears between 2555 and 2846 times, so they appear to be quite common.
Best, William
Hi @willibry,
Glad to hear that did the trick! For the chromosomes where ~2500 loci were skipped due to this issue, how many loci were successfully genotyped?
The STR references I've built in the past for HipSTR used TandemRepeatsFinder as described here. In general, we typically avoid STRs that are right next to one another (e.g. ATAG....ATAG next to AC.....AC) as these will be quite tricky to genotype and will likely result in an explosion of haplotypes. The scripts in the referenced repo should make it fairly easy for you to build a new STR reference using TRF. It might be worth generating an STR reference with those scripts and seeing whether that seems to result in fewer skipped STRs.
For now, I'll close this issue, but feel free to shoot me an email!
Best, Thomas
Hi @tfwillems,
This is the results for all but one chromosome. The number after the file name gives the number of problematic loci.
LG17.out 2616 Genotyping succeeded for 44236/57400 loci
LG23.out 2555 Genotyping succeeded for 55172/71100 loci
LG02.out 2846 Genotyping succeeded for 59602/75492 loci
LG13.out 2656 Genotyping succeeded for 60180/75459 loci
LG20.out 2748 Genotyping succeeded for 54216/68437 loci
LG15.out 3517 Genotyping succeeded for 61095/77858 loci
LG10.out 2785 Genotyping succeeded for 53276/68294 loci
LG19.out 3143 Genotyping succeeded for 50738/65770 loci
LG21.out 2877 Genotyping succeeded for 56728/72861 loci
LG22.out 3175 Genotyping succeeded for 58277/75511 loci
LG12.out 2822 Genotyping succeeded for 66587/84580 loci
LG06.out 3004 Genotyping succeeded for 63233/79658 loci
LG05.out 3497 Genotyping succeeded for 59529/76230 loci
LG18.out 3472 Genotyping succeeded for 59180/76570 loci
LG16.out 3089 Genotyping succeeded for 67271/84995 loci
LG08.out 3598 Genotyping succeeded for 65699/83205 loci
LG11.out 3085 Genotyping succeeded for 73401/92947 loci
LG09.out 3289 Genotyping succeeded for 63468/80986 loci
LG14.out 3651 Genotyping succeeded for 64331/82735 loci
LG01.out 3021 Genotyping succeeded for 75540/96192 loci
LG03.out 3947 Genotyping succeeded for 73068/94032 loci
LG07.out 4184 Genotyping succeeded for 76191/97230 loci
I'll try building a new reference using the scripts you suggested. The reason I chose to use Phobos for detecting STRs in the reference genome is that I could limit the output to exact (pure) repeats only, but I see now that the fix_trf_output.py script actually takes care of this.
Best, William
Hi @willibry,
Thanks for the update! That's definitely a lot more than I would expect based on my prior experience, but again that's primary been using human data.
I'd love to see whether this holds up if you get to test the other reference. Although those scripts don't restrict to pure repeats, they do require a minimum repeat score to filter out particularly short/interrupted STRs.
If it does hold up, I think this primarily suggests that either i) there are widespread mapping errors to that reference or ii) it's a real reflection of the diversity of repeat alleles in this population. Either way, super interesting!
Best, Thomas
Hi,
I am trying to genotype around 1.2 million STRs across 861 samples.
I'm running HipSTR in parallel (1 CPU per chromosome, total of 23), and it runs fine for some time, but during this step:
Realigning 3567 out of 4057 read pools to polish flanking sequences
when the number of read pools are fairly high, I get this error and HipSTR stops:
As far as I have understood, this is a memory allocation issue. I have access to 1.5 TB of memory, and each job is using 3-4 GB, so I should be fine memory-wise. Any suggestion on how to solve this?