jiantao / Tangram

Fast Structural Variation Detection Toolbox
MIT License
18 stars 6 forks source link

Tangram_scan segmentation fault using BWA aligned reads #5

Open zjassaf opened 9 years ago

zjassaf commented 9 years ago

Hi,

I would like to use tamgram to identify the location of transposable elements in Drosophila, however when I run tangram_scan I get a segmentation fault. I suspect that tangram_bam is not working, as it looks like the ZA headers are empty (I think?). However, I know that my strains should be heterozygous for a number of different transposable elements, and in fact there are already estimates of the locations. I'd rather not run Mosaik, so if there is a way to get tangram_bam to work that would be nice.

I've put below info about what I'm doing. Thanks! Zoe

As a positive control, I know, for example, that there should be at least 78 copies of INE_1 heterozygous in my strain, which I know from previous work. E.g., I have this data: te presence ch Upstream_estimate Downstream_estimate INE-1:TIR:DNA yes 2R 2496555 2497124 INE-1:TIR:DNA yes 4 286454 287918 INE-1:TIR:DNA yes 3L 17862241 17862700

I can get a copy of INE_1 sequence from flybase (transposon_sequence_set.embl.txt), so I make my moblist file, which contains only:

moblist_INE-1 GA(transposon_sequence_set.embl.txt) SN(Drosophila melanogaster) tatacccgttactagattcgttgaaatgaatgtaacaggcagaaggaagcgtcttagaccatatatagtatatacatacatgtatattcttgatcaggatcaatagccgagtcgatcttgccatatccgtctgtccgtatgaacgtcgagatctcaggaactataaaagctagaaggtttagattcagcatacagagacaaagacgcaagtagccatgcccactctaacgtccacaaacagcgcaaaactatcacgcccacacttttgaaaaatgtgttgttcttttcacattctgattagtcttttacatttctatcgatttccaaaaaaaaactttttgccaacgccctaaaaccgcccaaaactccgacacccacatttgtaaaaaattgttgggaatttttttcataaatttattagtttattatttattataaatttaagtttatatcgatttgccgacaacatattttaattttttttctcattttatcttttatctatcgatatcccagaaaaattgtgcaatttcgcattcacactagctgagtaacgggtatctgatagtcgggaaactcgactatagcattctctctttttgaaattgcgg

I generate my bam file with bwa, with the option -a to keep reads which only have 1 of the pair map to the genome (since this appears necessary for tangram?). These are the command line options I use: bwa mem -M -a -R

Then I remove duplicates and sort and index using PIcardTools. I also merge several bams together, because I have a single sample which was used to generate several libraries. Then with that merged bam I run tangram_bam: mySoftwarePath/Tangram/bin/tangram_bam -i myDataPath/MA_6.merged.dedup.bam -r myDataPath/moblist_ine_only.fasta -o myDataPath/MA_6.merged.dedup.tangram.bam

And sort the resulting stuff mySoftwarePath/java -Xmx2g -jar mySoftwarePath/picard-tools-1.105/SortSam.jar INPUT=myDataPath/MA_6.merged.dedup.tangram.bam OUTPUT=myDataPath/MA_6.merged.dedup.sorted.tangram.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE

Now generate my file list tangramBamList.txt, which contains only: myDataPath/MA_6.merged.dedup.sorted.tangram.bam

Now do tangram_scan: mySoftwarePath/Tangram/bin/tangram_scan -in myDataPath/tangramBamList.txt -dir myDataPath/tangramOut

And I get the error: Segmentation fault (core dumped)

This is what a sample of what my bam file looks like: D4LHBFN1:293:C3L3LACXX:2:2213:20193:18303 107 YHet 1 60 16S48M2S = 1 38 CTACGGTTGTCTCAGCAGGGTCACGTAATGCTGATCCAGTCTTGTTTTTATTTTCATTCATGTTGT BHGHIIIIG@HGG GDGIIGI:BDFHDFEGGG<FGHGIIIBHHFHCDHIIGHIFEHFHFFEDE?CCE PG:Z:MarkDuplicates RG:Z:140307_PINKERTON_0293_BC3L3LACXX_L2 NM:i:0 AS:i:48 XS:i:0 ZA:Z:<@;60;;;1;;><&;60;;;1;;> D4LHBFN1:293:C3L3LACXX:2:2213:20193:18303 151 YHet 1 60 28S38M = 1 -38 ATATGGTGTTTCCTACGGTTGTCTCCGCAGGGTCACGTAATGCTGATCCAGTCTTGTTTTTATTTT CDCDDDCADDDBDDDDFFHEH HB;-'GHGGDHDB2HBIGGHCGCEGIJJJJJJIJIJJJJJIHEBA PG:Z:MarkDuplicates RG:Z:140307_PINKERTON_0293_BC3L3LACXX_L2 NM:i:0 AS:i:38 XS:i:0 ZA:Z:&;60;;;1;;><@;60;;;1;; D4LHBFN1:293:C3L3LACXX:2:2313:15215:43919 147 YHet 10 60 83M = 21 -72 TAATGCTGATCCAGTCTTGTTTTTATTTTCATTCATGTTGTTGCTCTTGCTTTGATTCCGACTTCTAACGTTTAACCTGTGAT DDDDD DDDDDDCCDDEDDDFFFFFFGHHHHHJJJJJJJJJJJIJJJJIJHJIIIIJJJJHHJIJIIJJJJJJJHHJJIJJJII PG:Z:MarkDuplicates.3 RG:Z:140307_PINKERTON_0293_BC3L3LACXX_L2.3 NM:i:3 AS:i:68 XS:i:20 ZA:Z:&;60;;;1;;><@;60;;;1;; D4LHBFN1:293:C3L3LACXX:2:2314:3464:7166 99 YHet 17 60 82M = 56 122 GATCCAGTCTTGTTTTTATTTTCATTCATGTTGTTGCTCTTGCTTTGATTCCGACTTCTAACGTTTAACCTGTGATCAGACG AEDHGGIEFHHHH HIICAGGIIFE>DFDHHHGEHHIIIG@FGGGGIIIIIG@HIHIIIGHFHGEFFFF@@EECEEA;>CCCC PG:Z:MarkDuplicates.1 RG:Z:140307_PINKERTON_0293_BC3L3LACXX_L2.1 NM:i:3 AS:i:67 XS:i:20 ZA:Z:<@;60;;;1;;><&;60;;;1;;> D4LHBFN1:293:C3L3LACXX:2:2313:15215:43919 99 YHet 21 60 81M = 10 72 CAGTCTTGTTTTTATTTTCATTCATGTTGTTGCTCTTGCTTTGATTCCGACTTCTAACGTTTAACCTGTGATCAGACGTTT JIJHH

inti commented 9 years ago

Similar in here, after running

gkno tangram-bam --in bams/93-968.bam --mobile-element-fasta repeats/test_me.fa --out 93-968.tangram.bam --region Chr19 

i get the segmentation fault error

sh-4.2$ /home/shared/app/gkno_launcher/tools/Tangram/bin/tangram_scan -in /home/ipedroso/ANALYSES/MEI/Populus/file_list.text -dir tangram_out 
Violación de segmento

from the bam file header

@PG     ID:bwa  PN:bwa  VN:0.5.9-r16
@PG     ID:tangram_bam  CL:/home/shared/app/gkno_launcher/tools/Tangram/bin/tangram_bam --ref repeats/test_me.fa --input bams/93-968.bam --target-ref-name Chr19 --output /home/ipedroso/ANALYSES/MEI/Populus/93-968_ZA.bam

I have not tried re-aligning this data using MOSAIK.

AlistairNWard commented 9 years ago

I have also observed seg faults running on bwa data and am not sure what the cause of the problem is. If you don't have massive amounts of data, I would recommend aligning with Mosaik since this is what Tangram was designed to work with. If you need any assistance, please let me know ( AlistairNWard@gmail.com) and I can help getting Mosaik alignments and tangram run. In particular, we have a pipeline system (gkno) that helps running larger pipelines and also makes it possible to build your own pipelines for running repeated / similar analyses.

On Wed, Sep 16, 2015 at 1:36 PM, Inti Pedroso notifications@github.com wrote:

Similar in here, after running

gkno tangram-bam --in bams/93-968.bam --mobile-element-fasta repeats/test_me.fa --out 93-968.tangram.bam --region Chr19

i get the segmentation fault error

sh-4.2$ /home/shared/app/gkno_launcher/tools/Tangram/bin/tangram_scan -in /home/ipedroso/ANALYSES/MEI/Populus/file_list.text -dir tangram_out Violación de segmento

from the bam file header

@PG ID:bwa PN:bwa VN:0.5.9-r16 @PG ID:tangram_bam CL:/home/shared/app/gkno_launcher/tools/Tangram/bin/tangram_bam --ref repeats/test_me.fa --input bams/93-968.bam --target-ref-name Chr19 --output /home/ipedroso/ANALYSES/MEI/Populus/93-968_ZA.bam

I have not tried re-aligning this data using MOSAIK.

— Reply to this email directly or view it on GitHub https://github.com/jiantao/Tangram/issues/5#issuecomment-140862354.