gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

How does Prepde.py extract count #208

Open qinqin1995 opened 5 years ago

qinqin1995 commented 5 years ago

Hi, thank you for reading my question. I am confused with how the prepde.py extract read count from coverage. In the Stringtie manual, it says: "generates two CSV files containing the count matrices for genes and transcripts, using the coverage values found in the output of stringtie -e". Therefore, I thought it directly extract coverage value from the value of coverage in stringtie output. But I found that the output value is totally different from the coverage value. Then I check the script of prepde.py. I found that it does in this way: "transcriptList.append((g_id, t_id, int(ceil(coverage*transcript_len/read_len))))"

For example:

The ouput of stringtie is: (the read length is set as 75 when sequencing) t_id chr strand start end t_name num_exons length gene_id gene_name cov FPKM 4 LK031787 + 14626 16350 CDX67400 7 895 MSTRG.1 BnaA07g14400D 3.194972 1.396387 5 LK031787 - 16350 18172 MSTRG.2.1 8 1262 MSTRG.2 . 3.873593 1.692984

gene_id | sample output MSTRG.1 | 39 MSTRG.2 | 66

The output results is consistent with the "int(ceil(coverage*transcript_len/read_len)", but I dont understand why it calculates in this way. Does other read count methods also calculates in this way?

I am totally new to RNAseq analysis. Really appreciate for any help.