gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
381 stars 78 forks source link

Same gene id for multiple genes #270

Open m-waqas opened 4 years ago

m-waqas commented 4 years ago

I am interested to perform the differential gene expression (DE) and alternative splicing analysis of my RNA-seq dataset using Texudo2 approach as recommended in Pertea et al., 2016 (doi:10.1038/nprot.2016.095). I also want to use Deseq2 for DE analysis.

we have built a genome index using the command:

hisat2-build \ --ss Brassica_napus.AST_PRJEB5043_v1.43.ss \ --exon Brassica_napus.AST_PRJEB5043_v1.43.exons \ -f Brassica_napus.AST_PRJEB5043_v1.dna.toplevel.fa \ -p 4 \ napus

Then we aligned the reads using the following command: hisat2 -x genome/napus -1 fastq/${R1[$i]} -2 fastq/${R2[$i]} -S S${sample[$i]}.sam -p 4

Here I forget to use --dta parameter

Next, we sort and index the files:

samtools sort -@ 4 -o S${sample[$i]}.bam S${sample[$i]}.sam samtools index S${sample[$i]}.bam S${sample[$i]}.bai

The bam files were assembled using command: stringtie S${sample[$i]}.bam -G genome/Brassica_napus.AST_PRJEB5043_v1.43.gtf -o S${sample[$i]}.bam -l S${sample[$i]} -p 4

Next, We merged the assembles using stringtie --merge:

stringtie --merge -G genome/Brassica_napus.AST_PRJEB5043_v1.43.gtf -o Bnapus.gtf -p 4 merge_Bnapus_gtf.names

Merged assembly was compared using gffcomapre:

gffcompare -r genome/Brassica_napus.AST_PRJEB5043_v1.43.gtf -o Bnapus_compare Bnapus.gtf

This merged assembly was used to quantify expression using the command:

stringtie -e -B -G Bnapus.gtf -o ballgown/$sampleName/$sampleName.gtf -p 4 bam/$sampleName.bam

Now the problem is that I am getting the same gene id for different genes, I have checked different threads and noticed that this issue is raised several times (#159, #118, #153, #190, #217). It addressed well in some threads and is quite clear from this statement:

StringTie makes assembly decisions based primarily on the read alignment data. Reference annotation is often imperfect and lacking, and in order to allow for the discovery of novel isoforms, StringTie always uses the read alignments as the basis of transcript assembly. Unfortunately read alignments can also be wrong/imperfect and may actually "bridge" neighboring genes, as it seems to happen in our case where we are getting the same gene id for different genes.

I believe this issue can be sorted by three main approaches as mentioned in previous threads:

1) Using a better or more stringent read alignment strategy may help with this problem. Or some post-alignment filtering can be applied to the alignment data in order to eliminate large, low scoring alignments which seem to spuriously "connect" neighboring genes. This could be very challenging and time-consuming.

2) We can try to use stringent parameters for assembly. E.g. increasing value for -f (fraction of isoforms) -j (junction coverage) -c (coverage allowed for the predicted transcripts). This approach will help us up to some extent as some merged ref will be separated but there will still remain some merged together.

3) An alternate, faster differential expression analysis workflow can be pursued if there is no interest in novel isoforms (i.e. assembled transcripts present in the samples but missing from the reference annotation), or if only a well known set of transcripts of interest are targeted by the analysis. This simplified protocol has only 3 steps as it bypasses the individual assembly of each RNA-Seq sample and the "transcript merge" step to avoid the case that One internally generated locus IDs like MSTRG with several gene names.

Keeping in mind all these points, I still need your suggestions for some points to proceed further with downstream analysis.

1) I didn't use --dta/--downstream-transcriptome-assembly (Report alignments tailored for transcript assemblers including StringTie. With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites. This leads to fewer alignments with short-anchors, which helps transcript assemblers improve significantly in computation and memory usage.) which can be helpful for the downstream analysis. So my question is that do I really need to use it and have to do the alignment again or am I good to proceed for downstream analysis?

2) Is it a good idea to proceed with just uniquely mapped reads, maybe this help to reduce the false-positive and assist to sort out the problem of same gene id for multiple genes

3) If I used the alternate approach (which has only 3 steps) can I still detect the alternative splicing using ASTALAVISTA?

4) If I keep my focus on just known genes and perform DE analysis based on it, can we still publish good quality stuff?

5) How about performing analysis using following two complementary approaches to quantify expression and then perform DE analysis based on both approaches:

a) Stringtie b) Salmon

Looking forward to hearing from you. please let me know if you need any further details.

JessicaHayward commented 4 years ago

Hi, I had the exact same problem - different genes being tagged with the same unique MSTRG identifier from stringtie --merge command. My work-around was to use this python script (https://gist.github.com/e2b3e89fc0220e4363765fadc8c1dd73), which modifies the merged.gtf output file by adding a suffix onto MSTRG identifiers for each different gene_name, and if there is no gene_name, then it adds a suffix onto MSTRG identifiers for each different transcript_id. Then I used this modified gtf file in stringtie -e command. This isn't a solution to the problem for novel isoforms but thought I'd put it here in case it is helpful to others.

m-waqas commented 4 years ago

Hi Jessica,

I hope you are doing well. I believe this script will be of much help, can you please tell me how I need to use this script, means do I just need to provide merged gtf as input or also need to provide merged gtf and reference gtf as input, sorry I am not good in python so any help will be highly appreciated.

Best Regards,

Muhammad Waqas Khokhar


From: Jessica Hayward notifications@github.com Sent: 30 September 2020 17:32 To: gpertea/stringtie stringtie@noreply.github.com Cc: Khokhar, Muhammad (w.w.khokhar40@canterbury.ac.uk) w.w.khokhar40@canterbury.ac.uk; Author author@noreply.github.com Subject: Re: [gpertea/stringtie] Same gene id for multiple genes (#270)

Hi, I had the exact same problem - different genes being tagged with the same unique MSTRG identifier from stringtie --merge command. My work-around was to use this python script (https://gist.github.com/e2b3e89fc0220e4363765fadc8c1dd73), which modifies the merged.gtf output file by adding a suffix onto MSTRG identifiers for each different gene_name, and if there is no gene_name, then it adds a suffix onto MSTRG identifiers for each different transcript_id. Then I used this modified gtf file in stringtie -e command. This isn't a solution to the problem for novel isoforms but thought I'd put it here in case it is helpful to others.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gpertea/stringtie/issues/270#issuecomment-701503389, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACQRPYOWTVME3YBBKLCPGVLSINMS7ANCNFSM4MVVP7FA.

JessicaHayward commented 4 years ago

You just need your merged gtf file and then to run the script just use: python unique_gene_id.py merged_file.gtf The output just has ".unique_gene_id.gtf" on the end. Good luck!

m-waqas commented 4 years ago

Many thanks Jessica

Best Regards,

Muhammad Waqas Khokhar

PhD (Student)

School of Human and Life Sciences

Canterbury Christ Church University (CCCU)

Canterbury, Kent, UK.


From: Jess_Hay notifications@github.com Sent: 02 October 2020 15:05 To: gpertea/stringtie stringtie@noreply.github.com Cc: Khokhar, Muhammad (w.w.khokhar40@canterbury.ac.uk) w.w.khokhar40@canterbury.ac.uk; Author author@noreply.github.com Subject: Re: [gpertea/stringtie] Same gene id for multiple genes (#270)

You just need your merged gtf file and then to run the script just use: python unique_gene_id.py merged_file.gtf The output just has ".unique_gene_id.gtf" on the end. Good luck!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gpertea/stringtie/issues/270#issuecomment-702753876, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACQRPYKJT2WVDNXBQLCVRCLSIXM35ANCNFSM4MVVP7FA.