I've been doing RNAseq analyses with the HISAT/Stringtie pipeline for the last few years. Our computing centre updated their hardware a few weeks ago and subsequently uninstalled some older versions of programs. Ever since then my pipeline runs 'correctly' but when I get the gene_count_matrix.csv file out there are 'duplicate' genes. Example below:
The problem appears to be that Stringtie is interpreting MSTRG.12946 transcripts as additional genes when there are already predicted FhHiC23_g15386 transcripts. See below for lines from gtf:
This is slightly perplexing to me as I have instances of this in gtf files I processed a few years ago but this issue never arose. At this point I've tried changing stringtie version, python version, tried both prepDE and prepDE.py3, tried with the MSTRG_prep.pl renaming script and without it. So after 3 weeks of bashing my head off a wall I was wondering if anyone has come across this problem before and if so what is the solution? For reference I have used the following scripts (extracts) to process the data:
Hi,
I've been doing RNAseq analyses with the HISAT/Stringtie pipeline for the last few years. Our computing centre updated their hardware a few weeks ago and subsequently uninstalled some older versions of programs. Ever since then my pipeline runs 'correctly' but when I get the gene_count_matrix.csv file out there are 'duplicate' genes. Example below:
MSTRG.12946 gene:FhHiC23_g15386 FhHiC23_g15386 646 262 395 469 323 248 MSTRG.12946 gene:FhHiC23_g15386 8239 7744 7878 7112 6907 6842
The problem appears to be that Stringtie is interpreting MSTRG.12946 transcripts as additional genes when there are already predicted FhHiC23_g15386 transcripts. See below for lines from gtf:
FhHiC23_scaffold_8 StringTie transcript 44358590 44394716 1000 - . gene_id "MSTRG.12946|gene:FhHiC23_g15386"; transcript_id "MSTRG.12946.1"; FhHiC23_scaffold_8 StringTie transcript 44358590 44382847 1000 - . gene_id "MSTRG.12946|gene:FhHiC23_g15386"; transcript_id "MSTRG.12946.2"; FhHiC23_scaffold_8 StringTie transcript 44358654 44380005 1000 - . gene_id "MSTRG.12946|gene:FhHiC23_g15386"; transcript_id "MSTRG.12946.3"; FhHiC23_scaffold_8 StringTie transcript 44358866 44382847 1000 - . gene_id "MSTRG.12946|gene:FhHiC23_g15386"; transcript_id "MSTRG.12946.4"; FhHiC23_scaffold_8 StringTie transcript 44359269 44393180 1000 - . gene_id "MSTRG.12946|gene:FhHiC23_g15386"; transcript_id "transcript:FhHiC23_g15386.t2"; gene_name "FhHiC23_g15386"; ref_gene_id "gene:FhHiC23_g15386"; FhHiC23_scaffold_8 StringTie transcript 44359269 44393180 1000 - . gene_id "MSTRG.12946|gene:FhHiC23_g15386"; transcript_id "transcript:FhHiC23_g15386.t1"; gene_name "FhHiC23_g15386"; ref_gene_id "gene:FhHiC23_g15386";
Not every gene count is like this and some have a single line for each gene in cases where no additional MSTRG transcripts were predicted.
MSTRG.12945 gene:FhHiC23_g15385 FhHiC23_g15385 212 319 268 247 376 448
This is slightly perplexing to me as I have instances of this in gtf files I processed a few years ago but this issue never arose. At this point I've tried changing stringtie version, python version, tried both prepDE and prepDE.py3, tried with the MSTRG_prep.pl renaming script and without it. So after 3 weeks of bashing my head off a wall I was wondering if anyone has come across this problem before and if so what is the solution? For reference I have used the following scripts (extracts) to process the data:
Stringtie module load apps/samtools/1.9/gcc-14.1.0 module load stringtie/1.3.4b
stringtie -p 8 -G /mnt/scratch2/users/3049580/genome_index/fasciola_hepatica.PRJEB58756.WBPS19.canonical_geneset.gtf -o /mnt/scratch2/users/3049580/pi3k_analysis/Stringtie1/PN0526_0001_S1_L001.gtf /mnt/scratch2/users/3049580/pi3k_analysis/BAM_files/PN0526_0001_S1_L001.bam
Stringtie Merge
module load apps/samtools/1.9/gcc-14.1.0 module load stringtie/1.3.4b module load gffcompare/0.10.3 module load apps/cufflinks/2.2.2.20190706/gcc-8.5.0+boost-1.55.0+samtools-0.1.19+eigen-3.4.0 module load apps/perl/5.40.0/gcc-14.1.0
stringtie --merge -p 8 -G /mnt/scratch2/users/3049580/genome_index/fasciola_hepatica.PRJEB58756.WBPS19.annotations.gff3 -o /mnt/scratch2/users/3049580/pi3k_analysis/stringtiemerge/pi3k_RNAi.gtf mergelist.txt gffcompare -r /mnt/scratch2/users/3049580/genome_index/fasciola_hepatica.PRJEB58756.WBPS19.canonical_geneset.gtf -G -o /mnt/scratch2/users/3049580/pi3k_analysis/stringtiemerge/gffcompare_pi3k_ds /mnt/scratch2/users/3049580/pi3k_analysis/stringtiemerge/pi3k_RNAi.gtf perl mstrg_prep.pl /mnt/scratch2/users/3049580/pi3k_analysis/stringtiemerge/pi3k_RNAi.gtf > /mnt/scratch2/users/3049580/pi3k_analysis/stringtiemerge/named_pi3k_RNAi.gtf gffread -w /mnt/scratch2/users/3049580/pi3k_analysis/stringtiemerge/named_pi3k_RNAi.fasta -g /mnt/scratch2/users/3049580/genome_index/fasciola_hepatica.PRJEB58756.WBPS19.genomic.fa /mnt/scratch2/users/3049580/pi3k_analysis/stringtiemerge/named_pi3k_RNAi.gtf
Stringtie_estimate_Abundance
module load apps/samtools/1.9/gcc-4.8.5 module load stringtie/1.3.4b
stringtie -e -B -p 8 -G /mnt/scratch2/users/3049580/pi3k_analysis/stringtiemerge/named_pi3k_RNAi -o /mnt/scratch2/users/3049580/pi3k_analysis/Stringtie2_noname/rfp1_dsRNA.gtf -A /mnt/scratch2/users/3049580/pi3k_analysis/Stringtie2_noname/abundances/rfp1_dsRNA /mnt/scratch2/users/3049580/pi3k_analysis/BAM_files/PN0526_0001_S1_L001.bam
Prep for DESeq2
module load apps/python/2.7.17/gcc-14.1.0
python prepDE.py -i deseqtransfer.txt -g ctrl_ds_v_pi3k_ds_gene_counts.csv -t ctrl_ds_v_pi3k_ds_transcript_counts.csv
Many thanks anyone who can help!