gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
361 stars 76 forks source link

I wonder 'else' clause should be commented out (or deleted) in prepDE.py3, in the block of badGenes check #400

Open mickeykawai opened 1 year ago

mickeykawai commented 1 year ago

I wonder 'else' clause (at l.172-173 of version v2.2.1 Latest on Jan 26, 2022) should be commented out (or deleted) in the block of badGenes check in prepDE.py3.

for s in samples:
    badGenes=[] #list of bad genes (just ones that aren't MSTRG)
    try:
...
    except StopIteration:
        warnings.warn("Didn't get a GTF in that directory. Looking in another...")
    else: #we found the "bad" genes!
        break

The present version with the else clause results in setting geneIDs[t_id]=g_id in the for loop only for the first sample but not the rest of the samples. When there are expression differences among alternative splicing variants among samples, the number of the gene of some samples can be fewer in gene_count_matrix.csv than the sum of splicing variants seen in transcript_count_matrix.csv. Then at l.279,

          geneDict[geneIDs[i]][s[0]]+=v[s[0]]

when the geneIDs does not have corresponding record of g_id for a t_id, <class 'str'> will be set (and appear at the last line of gene_count_matrix.csv.

Example of my observation for a gene is as follows.

With the present version:

$ grep '\bSR34\b' gene_count_matrix.csv 
MSTRG.110|SR34,99,29,0,50,0,0,24,96,24   # <- the value of the last 2 columns, i.e. 96 and 24, is not the sum of the count of 3 transcripts. 
$ grep AT1G02840 transcript_count_matrix.csv 
AT1G02840.1,44,29,0,50,0,0,24,96,24
AT1G02840.2,55,0,0,0,0,0,0,0,0
AT1G02840.3,,0,0,0,0,0,0,18,32
$ 

After comment out of the else block:

$ grep '\bSR34\b' gene_count_matrix.csv 
MSTRG.110|SR34,99,29,0,50,0,0,24,114,56
$ grep AT1G02840 transcript_count_matrix.csv 
AT1G02840.1,44,29,0,50,0,0,24,96,24
AT1G02840.2,55,0,0,0,0,0,0,0,0
AT1G02840.3,,0,0,0,0,0,0,18,32
$