gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
385 stars 78 forks source link

Issue: could not locate transcript gene #381

Open carlmed00 opened 2 years ago

carlmed00 commented 2 years ago
    Hello!

I am using v2.2.1 but encounter a similar error.

CH_RNA-ZF_BPATCS/CH/EC_output$ prepDE.py -i sequence_mapping.txt -g genematrix.csv -t transcriptmatrix.csv
Error: could not locate transcript gene-b0731 entry for sample Chl1-2
Traceback (most recent call last):
  File "/home/ryan/stringtie/prepDE.py", line 284, in <module>
    geneDict.setdefault(geneIDs[i],{}) #gene_id
KeyError: 'gene-b0731'

I tried to recheck and made sure that I had -e on all my generated files but still same error.

Originally posted by @carlmed00 in https://github.com/gpertea/stringtie/issues/337#issuecomment-1281851293

BernadetteBiology commented 1 year ago

Hello! I do not have a solution for you; however, I would like to add that I have had a similar issue and found out what might be causing it. I am also running StringTie 2.2.1. I have experience successfully running StringTie (2.1.3b).

For clarity, I ran this pipeline (based on Pertea et al., 2016):

  1. Aligned RNA-seq reads to genome using hisat2
  2. Sort and convert sam to bam
  3. Assemble and quantified expressed genes and transcripts (I did NOT use a genome annotation, i.e. did not use -G option)
  4. Merged transcripts from all samples
  5. Estimate transcript abundances and create table counts for Ballgown using (-B -e options)
  6. Used prepDE.py3 to generate a gene counts matrix and transcript counts matrix.

In the resulting files this is what I noticed that were different from my previous results from StringTie (2.1.3b):

  1. gene_counts_matrix.csv: -There were a lot more 0 values for genes than when using StringTie (2.1.3b). -Also, the last row on the spreadsheet there is a "<class 'str'>" gene name that has high values.

  2. transcript_counts_matrix.csv: -There are many samples that have absent cells for some transcript rows but have values for other samples. Meaning, there is not any place holder value there. There is not even a 0.

I attempted to use Python2.7 and prepDE.py as well and was returned with your error. Upon going through all of the previous output files the error message is indeed true. When accessing a sample's merged.gtf output file (the one that contains the gene_id, transcript_id, FPKM and TPM values) certain samples do not have all of the transcript names. Meaning, some of my sample merged.gtf files have some transcript IDs, and some do not have those transcript IDs.

There may be something not happening correctly in the merging of transcripts from all samples step, but I am not certain.

Update: I ran the same reads and parameters with Stringtie 2.1.3b and there were no issues. I looked into other forums, and it seems like similar issues are occurring with this specific version. Maybe @gpertea would be able to help.