gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
361 stars 76 forks source link

duplication of gene level data from prepde #328

Open jcohn42 opened 3 years ago

jcohn42 commented 3 years ago

Hi, I am analyzing a dataset with several samples that I plan to use for downstream analysis of both gene level and transcript level analyses. I noticed an issue with the gene count matrix generated by prepde.py3 (thanks for the new version!) that is puzzling - there are apparently several duplications of single genes. A few details: I am running stringtie-2.1.5 on bam files output from hisat2 2.2.1, using 150bp PE illumina RNASeq reads aligned to the maize genome AGPv4 (https://www.ebi.ac.uk/ena/browser/view/GCA_000005005.6) with the gene model annotation file converted to gtf from the hista2 index to capture known (well, predicted) splice sites. Stringtie --merge was then used to merge the gtf files, which were relabeled to merge the mstrg and gene_id using the code that you have shared (mstrg_prep.py). The renamed output file seems to be fine. Next, I rerun stingtie using the -e and -B options using the renamed merged.gtf file. These files are then fed to prepde.py3. The output is almost as expected; the transcript count matrix seems okay, but the gene matrix has this odd duplication;

-bash-4.2$ cut -f 1,2,3,4,5 -d , prepde.gene_count_matrix.csv | head -n 5 gene_id,sample1,sample2,sample3,sample4 MSTRG.2|Zm00001d027230|Zm00001d027230,0,0,3,17 MSTRG.2|Zm00001d027230,357,1306,458,289 MSTRG.26,0,26,0,12 MSTRG.27,16,18,15,4

There is no such transcript "MSTRG.2|Zm00001d027230|Zm00001" in the merged .gtf file (or in the new gtf files run with the -e -B option for that matter). It seems the gene_id was duplicated here for some reason? -bash-4.2$ grep "Zm00001d027230" stringtie/_merged_rename_mstrg.gtf | awk -F '\t' '$3 == "transcript"' 1 StringTie transcript 44288 49837 1000 + . gene_id "MSTRG.2|Zm00001d027230"; transcript_id "MSTRG.2.1"; 1 StringTie transcript 44289 49837 1000 + . gene_id "MSTRG.2|Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_name "Zm00001d027230"; ref_gene_id "Zm00001d027230"; 1 StringTie transcript 46223 49837 1000 + . gene_id "MSTRG.2|Zm00001d027230"; transcript_id "MSTRG.2.3"; 1 StringTie transcript 46280 49837 1000 + . gene_id "MSTRG.2|Zm00001d027230"; transcript_id "MSTRG.2.4"; 1 StringTie transcript 46500 49837 1000 + . gene_id "MSTRG.2|Zm00001d027230"; transcript_id "MSTRG.2.5";

example of one of the sample files output from rerunning stringtie with -e -B calls for prepde, I also don't find any of the 'duplicated' names (in any of the files).

(rnaexpress_env) -bash-4.2$ grep "Zm00001d027230" stringtie_old/stringtie.prepde.gtf | awk -F '\t' '$3 == "transcript"' 1 StringTie transcript 44289 49837 . + . gene_id "MSTRG.1|Zm00001d027230"; transcript_id "Zm00001d027230_T001"; ref_gene_name "Zm00001d027230"; cov "0.0"; FPKM "0.000000"; TPM "0.000000"; 1 StringTie transcript 46223 49837 . + . gene_id "MSTRG.1|Zm00001d027230"; transcript_id "MSTRG.1.3"; cov "0.0"; FPKM "0.000000"; TPM "0.000000"; 1 StringTie transcript 46499 49837 . + . gene_id "MSTRG.1|Zm00001d027230"; transcript_id "MSTRG.1.5"; cov "0.0"; FPKM "0.000000"; TPM "0.000000"; 1 StringTie transcript 44288 49837 1000 + . gene_id "MSTRG.1|Zm00001d027230"; transcript_id "MSTRG.1.1"; cov "15.004131"; FPKM "1.087434"; TPM "2.530371"; 1 StringTie transcript 46280 49837 1000 + . gene_id "MSTRG.1|Zm00001d027230"; transcript_id "MSTRG.1.4"; cov "47.853241"; FPKM "3.468193"; TPM "8.070207"; 1 StringTie transcript 46527 49837 1000 + . gene_id "MSTRG.1|Zm00001d027230"; transcript_id "MSTRG.1.6"; cov "8.036685"; FPKM "0.582464"; TPM "1.355

I'm wondering why prepde is creating these new gene cluster names, and why an overlapping region (MSTRG.2) is output twice; with very different gene count results? The same pipeline has been run on different genomes; sometimes this is seen, but not always. If I understand what prepde is doing, it should cluster overlapping transcripts to a single gene, which doesn't seem to be happening in this case. shouldn't the gene names in the renamed merged.gtf file be preserved in the output of prepde?

Here is the annotation for that gene in the gtf file used; as you can see this one should be simple as there is only one predicted transcript: 1 gramene exon 44289 44947 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230"; 1 gramene exon 45666 45803 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230"; 1 gramene exon 45888 46133 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230"; 1 gramene exon 46229 46342 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230"; 1 gramene exon 46451 46633 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230"; 1 gramene exon 47045 47262 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230"; 1 gramene exon 47650 48111 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230"; 1 gramene exon 48200 49247 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230"; 1 gramene exon 49330 49837 . + . transcript_id "Zm00001d027230_T001"; gene_id "Zm00001d027230"; gene_name "Zm00001d027230";

Thanks for any hints you can provide to help troubleshoot this issue.

jcohn42 commented 3 years ago

update - after some troubleshooting; I think I may understand what is happening: if "gene_name" is available, this will be added to "gene_id", so prepde output will be "gene_id|gene_name". If the renaming step was used, this will produce "transcript_id|gene_id" ="gene_id" in the renamed .gtf file. So, if this renamed file is fed to prepde, you may have effectively "transcript_id|gene_id|gene_name", and if gene_id and _gene_name are identical, it will be redundant and appear as a duplicate as in the example.