tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

non-model organism, hipSTR is terminated at certain loci, doesnt accept multiple bam-samples in the list, doesn't find supplied chromosomes(scaffolds) as targets #15

Closed estolle closed 8 years ago

estolle commented 8 years ago

Hi there

I just ran into some issue using HipSTR to call SSR genotypes for some HiSeq4000 150bp PE reads for an insect non-model organism. Assembly contins >70000 contigs/scaffolds. The sequenced samples are haploid. Coverage is fairly low, but some individuals have something like 20X. I found 3 problems running HipSTR with my dataset. First, I seem to be not able to run multiple bam files. When running a single bamfile, the supplied target scaffold is not found in the SSR-bedfile, although its there. Last, when running on all samples (no specified target), it crashes after a few seconds at some locus - maybe depending on how many reads are left over for that locus after internal filtering (when the map-quality flag (min map quality set to 8) was used, the crash happend at a different locus. Any idea how to format the input lists (bam-samps, bam-libs) to make it work with hipSTR and what the issue with the crashed might be?

Thanks for any suggestions, Eckart

This is my command /home/scratch/programs/HipSTR/HipSTR \ --bams g20160730/bam/710508.bam \ --fasta $REF \ --regions $STRBED \ --bam-samps $BAMF/hipstr/hipstrbamsamples.lst \ --bam-libs $BAMF/hipstr/hipstrbamlib.lst \ --str-vcf $HPSTOUT/test.vcf.gz \ --max-flank-indel 0.25 \ --no-rmdup --max-mate-dist 1000 --viz-out $HPSTOUT/testviz.gz \ --max-str-len 100 --max-reads 100 --min-reads 3 --hide-allreads \ --stutter-out $HPSTOUT/stutter_models.txt

$STRBED (scaffold names contain "." and "_") scaffold00020 2555988 2556007 2 10 TA scaffold00020 2565730 2565752 3 7 ATA scaffold00020 2572977 2573003 3 9 GAC scaffold00020 2575532 2575556 2 12 TG scaffold00020 2580788 2580830 2 21 TA scaffold00020 2580860 2580907 2 24 AT

Problem1: multiple bamfiles / samplelists not recognized: 2 bam files as input do not work, --bams g20160730/bam/710508.bam,g20160730/bam/712506.bam $BAMF/hipstr/hipstrbamsamples.lst Co1B,M6b $BAMF/hipstr/hipstrbamlib.lst lib1,lib2 Detected 2 BAM files ERROR: Number of BAM files in --bams and samples in --bam-samps must match Exiting... --> imo a list (one entry per line) would be much easier to supply / generate than a csv ... naturally fitting to the bam-list input (one entry per line) --> I tried semicolon instead comma, no success, no idea now how else this list should be formatted, the github manual is a bit unclear there

Problem2: limit to specific chromosomes, specify all chromosomes as haploid single bam as input, if no chr / chr-ploidy is set it crashes anyway (see below) input a single bam file works $BAMF/hipstr/hipstrbamsamples.lst Co1B $BAMF/hipstr/hipstrbamlib.lst lib1 --bams g20160730/bam/710508.bam

     --chrom scaffold00020,scaffold00008 \
     --haploid-chrs scaffold00020,scaffold00008 \

--> runs through, reporting 0 loci found in the bed file --> the loci are present (see above) --> an option setting everything to haploid would be great (I work with haploid samples). --> alternative It it would be possible to provide a list.txt containing each scaffolds/contig/chr (one per line) which is haploid, that would be easy to supply. --> right now I would need to supply 70000 contig/scaffold names in the hipSTR command

Problem3: crash after short while

with --min-mapq 8

... ... ... Processing region contig20828 0 52 Found 0 fully paired reads and 0 unpaired reads 28 reads overlapped region, of which 0 had an SA (split alignment) BAM tag 0 had an 'N' base call 25 had too low of a mapping quality 0 had low base quality scores 3 did not span the STR 0 did not have a unique mapping 0 did not have a mate pair 0 PASSED ALL FILTERS

terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr: __pos (which is 4294967295) > this->size() (which is 12889) [1] 20006 abort (core dumped) /home/scratch/programs/HipSTR/HipSTR --bams g20160730/bam/710508.ba

run with no map quality filter

... ... ... Processing region contig20531 16624 16645 Found 8 fully paired reads and 0 unpaired reads 33 reads overlapped region, of which 9 had an SA (split alignment) BAM tag 0 had an 'N' base call 0 had too low of a mapping quality 0 had low base quality scores 13 did not span the STR 1 did not have a unique mapping 2 did not have a mate pair 8 PASSED ALL FILTERS

Phased SNPs add info for 0 out of 8 reads and 0 out of 1 samples Building EM stutter genotyper Training EM stutter genotyper Learned stutter model: IN_FRAME [P_GEOM(rep)=0.953462, P_DOWN=0.112761, P_UP=0.112761] OUT_FRAME[P_GEOM(bp) =0.972593, P_DOWN=0.112761, P_UP=0.112761]

Left aligning reads... HipSTR: SeqAlignment/AlignmentOps.cpp:271: void trimAlignment(BamTools::BamAlignment&, int32_t, int32_t, char): Assertion `ltrim+rtrim <= aln.QueryBases.size()' failed. [1] 20055 abort (core dumped) /home/scratch/programs/HipSTR/HipSTR --bams g20160730/bam/710508.ba

tfwillems commented 8 years ago

Hi Eckart,

I'm sorry that you're encountering these issues. In regards to problem 1, it looks like you're providing --bam-samps and --bam-libs with a path to a file. But instead it should be a comma-separated list, just like in --bams. So for example, youe arguments may look something like: --bams sample_1.bam,sample_2.bam,sample_3.bam --bam-samps SAMPLE1,SAMPLE2,SAMPLE3 --bam-libs LIB1,LIB2,LIB3

In regards to your other two issues, debugging them will be a bit difficult if I don't have access to the data myself. Would it be possible for you to send me the STR region file and the BAM that it's crashing with? If so, you can email them to twillems at mit dot edu. Adding an option to treat all chromosomes as haploid should be fairly straightforward once we've resolved these two issues

Best, Thomas

estolle commented 8 years ago

an little additional thing: tried to viz some loci which apparently went through the analysis (before it terminated). The file was tabix indexed and the I used the coordinates fro a SSR present in the bedfile and which should have been processed

VizAln $BAMF/hipstr/testviz.gz scaffold00001 6799480 6799534

got this and Its unclear what this means/how to fix Couldn't get a file descriptor referring to the console

tfwillems commented 8 years ago

VizAln tries to open an HTML file that it generates containing the alignments. But if you're remotely connected to a server without X forwarding, you can't open files and so it'll generate this "Couldn't get a file descriptor referring to the console" error.

tfwillems commented 8 years ago

I've added a --hap-chr-file option in which you can specify a file containg haploid chromosome names, one per line

estolle commented 8 years ago

Hi Thomas

awesome, thanks for the clarification of the samples/libs. Indeed i though it expect a file. I will try to run it again this way. The --hap-chr-file option is great for our purpose! thanks alot!

And thanks alot for the offer to look up the error regarding certain loci. I will prep the bams to send over asap. And yeah, the VizAlign thing also came up when I had the X forwarding (eg xclock works as normal). But I can try to fix this when the other things work out. Maybe it had to do with the data in the end (coming from the terminated run). Thansk alot in advance.

tfwillems commented 8 years ago

Resolved these issues by ignoring STRs very near to contig ends and discarding BAM alignments with empty sequences