gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

Overlapped gene ids in the gene abundance tab file #363

Closed dahun73 closed 2 years ago

dahun73 commented 2 years ago

Hello :)

I tried to use stringtie which is one of the most famous quantification tools! I want to get tpm or fpkm profiles in gene abundance file. However, in my result, there are some overlapped genes(actually overlapped 'gene ids').

Actually, I used this command:

for FILE in $(ls 02.Align/*Aligned.sortedByCoord.out.bam)
do
    if [ ! -s "stringtie/$(basename ${FILE%Aligned.sortedByCoord.out.bam})/$(basename ${FILE%Aligned.sortedByCoord.out.bam}).gtf" ]

    stringtie --rf -p 16  \
    -A stringtie/$(basename ${FILE%Aligned.sortedByCoord.out.bam})/gene_id.tab \
    -G /home/BioResource/GeneSet/mm10/gencode.vM24.annotation.gtf \
    -e -B -o stringtie/$(basename ${FILE%Aligned.sortedByCoord.out.bam})/$(basename ${FILE%Aligned.sortedByCoord.out.bam}).gtf \
    02.Align/$(basename ${FILE%})
    else    echo "StringTie output already exist"
    fi     
done

When I checked the total gene id in gene abundance tab file, there are a few overlapped gene here.

cat gene_id.tab | awk -F"\t" '{print $1}' | sed '1,1d' | sort -n | uniq -c | sort | tail -n 20
      1 ENSMUSG00000118631.1
      1 ENSMUSG00000118632.1
      1 ENSMUSG00000118633.1
      1 ENSMUSG00000118634.1
      1 ENSMUSG00000118635.1
      1 ENSMUSG00000118636.1
      1 ENSMUSG00000118637.1
      1 ENSMUSG00000118638.1
      1 ENSMUSG00000118639.1
      1 ENSMUSG00000118640.1
      2 ENSMUSG00000001027.7
      2 ENSMUSG00000026162.7
      2 ENSMUSG00000030228.18
      2 ENSMUSG00000056050.13
      2 ENSMUSG00000056856.13
      2 ENSMUSG00000097497.1
      2 ENSMUSG00000100826.6
      2 ENSMUSG00000104544.3
      4 ENSMUSG00000025515.15
      4 ENSMUSG00000093803.3

In my thought, it should not show any overlapped gene ids. Also, I tried to check my gtf file whether there are overlapped genes or not. However, there is no overlapped gene id.... Or is there any something missing option in my command..? Surprisingly, in prepDE.py result, there are only unique gene ids...I could not understand why this happened.....