Merged.gtf file issue with identical gene_id for two different gene_name

JFourquet commented 7 years ago

Hi,

I want to use Stringtie for transcriptome assembly. I have 33 samples (3 for each condition) and the pipeline used is:

HISAT2 2.0.5 with the commande line: hisat2 -p 2 --no-softclip -q --rna-strandness RF --dta -x mm10/genome -1 trimmed_lib1_R1.fastq.gz -2 trimmed_lib1_R2.fastq.gz -S trimmed_lib1.sam"

Samtools view: samtools view -bS trimmed_lib1.sam > lib1.bam

Samtools sort: samtools sort lib1.bam -o lib1.sorted.bam

Stringtie 1.3.3: stringtie-1.3.3b.Linux_x86_64/stringtie lib1.sorted.bam -o lib1.gtf -p 2 -m 150 -G gencode.vM13.annotation.gtf --rf

I have tested here cuffmerge and stringtie --merge to create a merged.gtf file of all .gtf files.

Cuffmerge command line: cuffmerge -o denovo_Stringtie_200nt -g gencode.vM13.annotation.gtf -s Genome/Bowtie-index/mm10.fa assembly_GTF_list_Stringtie_200nt_170522.txt -p 3" with assembly_GTF_list_Stringtie_200nt_170522.txt the list of all of 33 .gtf files generated with Stringtie.

Stringtie --merge command line: stringtie-1.3.3b.Linux_x86_64/stringtie --merge (33 .gtf files) -o merged_stringtie200nt_170808.gtf -p 2 -G gencode.vM13.annotation.gtf

The issue is in the two merged.gtf files generated separately with cuffmerge and stringtie --merge I have genes which have the same gene_id but which are not the same genes, for example :

115 chr1 4807830 4841286 33457 + StringTie transcript 1000 NA MSTRG.17 ENSMUST00000150971.7 Lypla1 ENSMUSG00000025903.14 116 chr1 4807830 4807982 153 + StringTie exon 1000 NA MSTRG.17 ENSMUST00000150971.7 Lypla1 ENSMUSG00000025903.14 1 140 chr1 4807892 4886770 78879 + StringTie transcript 1000 NA MSTRG.17 ENSMUST00000155020.1 Gm37988 ENSMUSG00000104217.1 141 chr1 4807892 4807982 91 + StringTie exon 1000 NA MSTRG.17 ENSMUST00000155020.1 Gm37988 ENSMUSG00000104217.1 1

I don't known if I have made a mistake in the command lines or in options but finally with cuffmerge I have 128532 gene_id and 3356 of these 128532 are linked to two or more gene_names (known or unknown). It is really a problem when I use prepDE.py to quantify number of reads aligned to each gene because I have one line for each gene_id so for two or more genes which have not the same name and the same coordinates but which have the same gene_id I don't know how prepDE.py have count the number of reads aligned to it and I haven't the number for each gene separately. In the .gtf files generated with the first run of stringtie (for each sample), I already have this issue: I have the same gene_id "STRG. ..." for different gene names.

Do you know where this issue comes from ?

To solve this issue, we used the oId column where there is written "ENSMUST..." when the exon is an exon of a known transcript whereas it is written "CUFF..." when the exon comes from an unknown transcript. When the exon is known (first case), the gene_id is replaced by the gene_name. Do you know if it is a good solution ?

Thanks a lot in advance for yours answers !

brightbio commented 6 years ago

what if analysing differential gene expression following the protocol "An alternate, faster differential expression analysis workflow can be pursued if there is no interest in novel isoforms (i.e. assembled transcripts present in the samples but missing from the reference annotation), or if only a well known set of transcripts of interest are targeted by the analysis. This simplified protocol has only 3 steps (depicted below) as it bypasses the individual assembly of each RNA-Seq sample and the "transcript merge" step. "? avoid the case that One internally generated locus IDs like MSTRG.18605 with several gene names ? qq 20180308181127

JFourquet commented 6 years ago

Thanks a lot for your answer !

gpertea / stringtie

Merged.gtf file issue with identical gene_id for two different gene_name #153