Hi, thank you for reading my question. I am confused with how the prepde.py extract read count from coverage. In the Stringtie manual, it says: "generates two CSV files containing the count matrices for genes and transcripts, using the coverage values found in the output of stringtie -e". Therefore, I thought it directly extract coverage value from the value of coverage in stringtie output. But I found that the output value is totally different from the coverage value. Then I check the script of prepde.py. I found that it does in this way: "transcriptList.append((g_id, t_id, int(ceil(coverage*transcript_len/read_len))))"
For example:
The ouput of stringtie is: (the read length is set as 75 when sequencing)
t_id chr strand start end t_name num_exons length gene_id gene_name cov FPKM
4 LK031787 + 14626 16350 CDX67400 7 895 MSTRG.1 BnaA07g14400D 3.194972 1.396387
5 LK031787 - 16350 18172 MSTRG.2.1 8 1262 MSTRG.2 . 3.873593 1.692984
gene_id | sample output
MSTRG.1 | 39
MSTRG.2 | 66
The output results is consistent with the "int(ceil(coverage*transcript_len/read_len)", but I dont understand why it calculates in this way. Does other read count methods also calculates in this way?
I am totally new to RNAseq analysis. Really appreciate for any help.
Hi, thank you for reading my question. I am confused with how the prepde.py extract read count from coverage. In the Stringtie manual, it says: "generates two CSV files containing the count matrices for genes and transcripts, using the coverage values found in the output of stringtie -e". Therefore, I thought it directly extract coverage value from the value of coverage in stringtie output. But I found that the output value is totally different from the coverage value. Then I check the script of prepde.py. I found that it does in this way: "transcriptList.append((g_id, t_id, int(ceil(coverage*transcript_len/read_len))))"
For example:
The ouput of stringtie is: (the read length is set as 75 when sequencing) t_id chr strand start end t_name num_exons length gene_id gene_name cov FPKM 4 LK031787 + 14626 16350 CDX67400 7 895 MSTRG.1 BnaA07g14400D 3.194972 1.396387 5 LK031787 - 16350 18172 MSTRG.2.1 8 1262 MSTRG.2 . 3.873593 1.692984
gene_id | sample output MSTRG.1 | 39 MSTRG.2 | 66
The output results is consistent with the "int(ceil(coverage*transcript_len/read_len)", but I dont understand why it calculates in this way. Does other read count methods also calculates in this way?
I am totally new to RNAseq analysis. Really appreciate for any help.