gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

Non-existing gene names being assigned to known genes #249

Closed wyoibc closed 4 years ago

wyoibc commented 4 years ago

This problem is unrelated to the insertion of MSTRG IDs for transcripts that only partially overlap a known gene. I am using stringtie v1.3.4a.

The A. thaliana reference GTF I am using was obtained from here.

Let's consider the case of the gene AT1G27300

grep "AT1G27300"  TAIR10_GFF3_genes.gff 

Chr1    TAIR10  gene    9483157 9484368 .       +       .       ID=AT1G27300;Note=protein_coding_gene;Name=AT1G27300
Chr1    TAIR10  mRNA    9483157 9484368 .       +       .       ID=AT1G27300.1;Parent=AT1G27300;Name=AT1G27300.1;Index=1
Chr1    TAIR10  protein 9483324 9484194 .       +       .       ID=AT1G27300.1-Protein;Name=AT1G27300.1;Derives_from=AT1G27300.1
Chr1    TAIR10  exon    9483157 9483464 .       +       .       Parent=AT1G27300.1
Chr1    TAIR10  five_prime_UTR  9483157 9483323 .       +       .       Parent=AT1G27300.1
Chr1    TAIR10  CDS     9483324 9483464 .       +       0       Parent=AT1G27300.1,AT1G27300.1-Protein;
Chr1    TAIR10  exon    9483733 9484368 .       +       .       Parent=AT1G27300.1
Chr1    TAIR10  CDS     9483733 9484194 .       +       0       Parent=AT1G27300.1,AT1G27300.1-Protein;
Chr1    TAIR10  three_prime_UTR 9484195 9484368 .       +       .       Parent=AT1G27300.1

This gene, located on chromosome 1, stretches from position 9483157 through 9484368 (1211 bp) (NCBI Link; TAIR Link).

When Stringtie outputs the sample-wise GTF files, this ID is retained as expected. However, in the step where individual GTF files are merged into a Master GTF and then subsequently the merged GTF is used to prepare count tables for ballgown, this ID is replaced with something entirely different. Case in point:

grep "AT1G27300" sample1.gtf

Chr1    StringTie   transcript  9483324 9484194 1000    +   .   gene_id "AT1G01010230"; transcript_id "AT1G27300.1-Protein"; cov "91.692024"; FPKM "25.861988"; TPM "27.150040";
Chr1    StringTie   exon    9483324 9483464 1000    +   .   gene_id "AT1G01010230"; transcript_id "AT1G27300.1-Protein"; exon_number "1"; cov "42.144100";
Chr1    StringTie   exon    9483733 9484194 1000    +   .   gene_id "AT1G01010230"; transcript_id "AT1G27300.1-Protein"; exon_number "2"; cov "106.813782";
Chr1    StringTie   transcript  9483157 9484368 1000    +   .   gene_id "AT1G01010230"; transcript_id "AT1G27300.1"; ref_gene_name "AT1G27300"; cov "13.733804"; FPKM "3.873657"; TPM "4.066584";
Chr1    StringTie   exon    9483157 9483464 1000    +   .   gene_id "AT1G01010230"; transcript_id "AT1G27300.1"; exon_number "1"; ref_gene_name "AT1G27300"; cov "4.586631";
Chr1    StringTie   exon    9483733 9484368 1000    +   .   gene_id "AT1G01010230"; transcript_id "AT1G27300.1"; exon_number "2"; ref_gene_name "AT1G27300"; cov "18.163567";

The gene_id being assigned here is AT1G01010230, which when searched either on TAIR or NCBI, results in no hits. The ID also can't be found in the reference GFF.

This is just one example of the numerous instances where real gene_ids are being replaced by manufactured IDs. Perhaps the most troubling aspect of this is that the new gene ids are being named in the A. thaliana fashion making them seem like known ids when they are anything but.

I was hoping @gpertea could shed some light on this. The code being used is identical to that described in Pertea et al (2016) Nature Protocols paper.

gpertea commented 4 years ago

It does seem weird. At that stage (running stringtie -e with -G set to the merged GTF), all the output IDs (transcript's and gene's) should be the same with the IDs in the merged GTF. The name of that merged GTF file can be found by looking at the command line that was used to generate that sample1.gtf (which should be the 1st line of that output file). Can you locate that merged GTF file and look at the 1st line in that file to see what was the full command line for the stringtie --merge stage?

I would suggest focusing on that merged GTF -- the same "manufactured" IDs should be there too, so can you grep "AT1G01010230" in there as well? Also, looking at that command line for stringtie --merge, what reference was used for -G there ? I guess it was the one you mentioned above, TAIR10_GFF3_genes.gff. That file in itself is a bit "unusual" for me - they decided to add the redundant "protein" features and the CDS segments are doubly parented by both the "mRNA" and "protein" features which may have caused some trouble for the GFF parser.. (still, no excuse, I should've handled that). I'll check if my parser didn't stumble into those extraneous "protein" features and somehow triggered a strange bug that messed up the gene IDs (?).. I don't see how exactly that could've happened but I'll investigate.

It is also possible that v1.3.4a could've been plagued by some GFF3 parsing bugs that were later fixed (though unlikely..) - could you perhaps re-run the same stringtie --merge step using the latest version of stringtie (v2.0.4) ? And then check if the same strange "made-up" gene IDs are there..

wyoibc commented 4 years ago

Thanks for your speedy and very helpful response.

Indeed I checked the merged GTF along with the sample-wise GTFs and they all have the same manufactured IDs as mentioned above. In looking back at my scripts, it appears that I used only the -G flag set to reference GTF in the merge script, and then flags -e and -G (this time set to merged GTF) in the subsequent script.

I agree that the reference GTF looks non-standard posing its own challenges. But before you spend time on troubleshooting the parser in 1.3.4a, let me rerun the analysis with the latest version to make sure the problem remains. More on that soon.

wyoibc commented 4 years ago

As it turns out, this problem is not of stringtie's making. I have been replacing the MSTRG IDs with corresponding AT gene IDS using sed and because of the way replacement was set up, it was performing unnecessary operations due to extended match. This resulted in additional numbers after the gene_ids.

Apologies for the false alarm. One must use extreme caution when doing global replacements using sed. I have since give up this approach and fixed it using another methods.

As for stringtie v2.x, it generates output identical in format to that of v1.3.4a, with the exception that the total number of entries in the merged GTF is substantially lower. Since that is a different issue, I am closing this thread.