gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

Mutliple ref_gene_name for a given gene_id in merged GTF #286

Open dantaki opened 4 years ago

dantaki commented 4 years ago
stringtie --merge stringtie_gtf.list -G Mus_musculus.GRCm38.84.gtf -o stringtie_merged.gtf

I ran this command using HiSAT2 aligned RNAseq data from mouse and I noticed that many genes in the reference GTF have more than one ref_gene_name

For example:

grep "MSTRG.13821" stringtie_merged.gtf  | grep "transcript\t" -P 
19  StringTie   transcript  4907229 4928016 1000    -   .   gene_id "MSTRG.13821"; transcript_id "MSTRG.13821.1"; 
19  StringTie   transcript  4907229 4928287 1000    -   .   gene_id "MSTRG.13821"; transcript_id "MSTRG.13821.2"; 
19  StringTie   transcript  4907229 4928287 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000025851"; gene_name "Dpp3"; ref_gene_id "ENSMUSG00000063904"; 
19  StringTie   transcript  4907232 4923318 1000    -   .   gene_id "MSTRG.13821"; transcript_id "MSTRG.13821.4"; 
19  StringTie   transcript  4907233 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "MSTRG.13821.5"; 
19  StringTie   transcript  4929746 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "MSTRG.13821.6"; 
19  StringTie   transcript  4930651 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000120475"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4931856 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000025834"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4934986 4938628 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000139436"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4935012 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000146289"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4935016 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000133254"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4935023 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000133504"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4941600 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000143930"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 

grep "Dpp3" founder_stringtie_merged.gtf  | grep "transcript\t" -P 
19  StringTie   transcript  4907229 4928287 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000025851"; gene_name "Dpp3"; ref_gene_id "ENSMUSG00000063904"; 

grep "Peli3" founder_stringtie_merged.gtf  | grep "transcript\t" -P 
19  StringTie   transcript  4930651 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000120475"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4931856 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000025834"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4934986 4938628 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000139436"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4935012 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000146289"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4935016 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000133254"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4935023 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000133504"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 
19  StringTie   transcript  4941600 4943127 1000    -   .   gene_id "MSTRG.13821"; transcript_id "ENSMUST00000143930"; gene_name "Peli3"; ref_gene_id "ENSMUSG00000024901"; 

So there are no other annotations for Dpp3 or Peli3 and if I use the merged GTF for analysis I cannot distinguish between the genes since they have the same gene_id

Is the solution just replacing the ref_gene_id with the gene_id?

Thank you