mbaughn / rna-star

Automatically exported from code.google.com/p/rna-star
0 stars 0 forks source link

bug with ISIZE numbers in .sam ? #21

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1. runing star with parameters:

star --genomeDir $GENOME_DIR \
--readFilesIn $READS1 $READS2 \
--runThreadN $TREADS \
--genomeLoad LoadAndKeep \
--alignIntronMax 500000 \
--alignMatesGapMax 500000 \
--outFileNamePrefix $OUT/ \
--outFilterMultimapNmax 6 \
--outFilterMismatchNmax 3 \
--outFilterMismatchNoverLmax 0.05 \
--outFilterMatchNmin 16 \
--outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0 \
--outSAMunmapped None \
--outReadsUnmapped Fastx \
--sjdbFileChrStartEnd $GENOME_DIR \
--sjdbOverhang $SJ_DB_OVERHANG \
--chimSegmentMin $CHIM_SEGMENT_MIN \
--chimScoreMin $CHIM_SCORE_MIN \
--clip3pAdapterSeq TCGTATGCCGTCTTCTGCTTG \
--clip3pAdapterMMp 0.1

2. when applied htseq-count as read counter in some cases htseq fails and 
report an error

3.What is the expected output? What do you see instead?

htseq-count normally produced table with counts for each gene-ID.
got errors, similar to this one:

Error occured when processing SAM input (line 3773018 of file 
/III_pREP_Input/star/Aligned.out.sam):
  Python int too large to convert to C long
  [Exception type: OverflowError, raised in _HTSeq.pyx:1313]

that line doesn't look ok at all - huge number or/and  merged with read 
sequence:

HWI-ST1149:193:C4309ACXX:3:1114:1490:42590  355 chr2    27274108    0   28S23M  =   27274082
    18446744073709551615CCGGGGGGATTAGCTCCAATGGTAGAGCCTCGCTTGGCTTGCGAGAGGTAG =?=DBDD
:0:>>AAA3((383>388(8>=A87:<==<AA?:?:0055=339    NH:i:5  HI:i:4  AS:i:28 nM:i:2

What version of the product are you using? On what operating system?

star231z1

Please provide any additional information below.

see other examples of problematic  lines from .sam files below

HWI-ST1149:193:C4309ACXX:3:1104:17209:8771  339 chr14   70236478    3   28S16M7S    =   69795
937 -440557 ATTGCTCTCGTTACCTCGGGAATTGAGGTTCCGAATAAGAGGTCATTGGCG HJJJJIIIJJHHFJII
JJJJJJJJJJJJJJJJJJJJJJHHHHHFFFDDB:B NH:i:2  HI:i:2  AS:i:20 nM:i:0

HWI-ST1149:193:C4309ACXX:3:1109:17901:52090 355 chr2    70230204    0   22S15M14S   =   7023
0182    18446744073709551611    GGGGCAATACAGAATGTTCGTCGAGTTAAATCCTCTGTAGACGACTTAAAT BB
CDFFFFHHHHGIJJIIJJJJJEHGHJIJJHIIIJJIJIGlsHJJJJJIHIJ NH:i:5  HI:i:2  AS:i:16 nM:i:0

HWI-ST1149:193:C4309ACXX:3:1114:1490:42590  355 chr2    27274108    0   28S23M  =   27274082
    18446744073709551615CCGGGGGGATTAGCTCCAATGGTAGAGCCTCGCTTGGCTTGCGAGAGGTAG =?=DBDD
:0:>>AAA3((383>388(8>=A87:<==<AA?:?:0055=339    NH:i:5  HI:i:4  AS:i:28 nM:i:2

HWI-ST1149:193:C4309ACXX:1:2105:10512:63315 355 chr2    70230204    1   35S15M1S    =   70230
182 18446744073709551611    GCGACGATATTTCACCACAATACAGAATGTTGGTCGAGTTAAATCCTCTGT BB@
DFFFFHHHHDIJIJJJIJJJJJIJIJHIIIHIJFIFHIJJIIJIIJIC    NH:i:4  HI:i:4  AS:i:16 nM:i:0

thanks for you help.

Vladimir

Original issue reported on code.google.com by vkurys...@yahoo.com on 4 Apr 2014 at 1:09

GoogleCodeExporter commented 8 years ago
Hi Vladimir,

thanks for reporting this error, it looks like a bug.
Could you please extract the sequence of both mates from fastq files for
one of the problematic reads, and send them to me.
I will try to replicate the problem. Also, please let me know which genome
are you mapping to.

For a faster response, please post your questions in the STAR forum 
https://groups.google.com/d/forum/rna-star, or e-mail me directly at 
dobin@cshl.edu 

Cheers
Alex

Original comment by adobin@gmail.com on 8 Apr 2014 at 6:17

GoogleCodeExporter commented 8 years ago
Hi Alex,

thanks for your response!

this time I tested with star231o version. sorry, don't have in hand now my 
previous outputs. hope to send you later. mapping done against hg19.  

here is one problematic line in sam:
HWI-ST1149:193:C4309ACXX:5:2204:20157:21098 99  chr1    26140610    3   34S17M  =   26140585
    18446744073709551614    TGTTCTGCAGTTCCTCCAGAAGCTGGCTGGCCCTCACCTGGAGAAGTACAG @@@DDD
DABF?AFFHEGEHGFGGH@@EFHE8CFFHEDDHB;BB?DE8BF>?   NH:i:2  HI:i:2  AS:i:20 nM:i:2

here are 2 reads from input fastq files:

@HWI-ST1149:193:C4309ACXX:5:2204:20157:21098 1:N:0:GCCAAT
TGTTCTGCAGTTCCTCCAGAAGCTGGCTGGCCCTCACCTGGAGAAGTACAG
+
@@@DDDDABF?AFFHEGEHGFGGH@@EFHE8CFFHEDDHB;BB?DE8BF>?
@HWI-ST1149:193:C4309ACXX:5:2204:20157:21098 2:N:0:GCCAAT
TGGAAGCTGTACTTCTCCAGGTGAGGGCCAGCCAGCTTCTGGAGGAACTGC
+
+:=A?BDDF?DFFGIIFIFIIFFEFFFIBGEGGF?BGIIIIIGIEIIIIII

a few more examples you can find in attachment.

hope it helps.

Vladimir

Original comment by kurysh...@gmail.com on 9 Apr 2014 at 5:23

Attachments:

GoogleCodeExporter commented 8 years ago
Dear Alex,

is there any chance to move on with this issue? Sorry to be impatient I hope 
you will have time to take a look at the problem.

Thanks.

Vladimir

Original comment by vkurys...@yahoo.com on 15 Apr 2014 at 9:56

GoogleCodeExporter commented 8 years ago
Dear Alex,

your new patch (STAR_2.3.1y1) has solved the problem and now I got all my read 
counts from STAR-generated alignments with HTSeq-count without any errors.

Thanks a lot for your great job, kind and quick support!

Vladimir

Original comment by vkurys...@yahoo.com on 17 Apr 2014 at 9:09