gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

Error when running prepDE.py #238

Open ms-gx opened 5 years ago

ms-gx commented 5 years ago

I get the following error when running prepDE.py on stringtie estimate output:

Traceback (most recent call last):
  File "./prepDE.py", line 255, in <module>
    geneDict.setdefault(geneIDs[i],{}) #gene_id
KeyError: 'STRG.337.1'

I only get this for one sample out of fifteen. I run the HISAT2 > Stringtie workflow as described in Pertea et al. (2016). I want to feed this into edgeR. Is there another option in case I can not resolve this problem?

Additional question: What is the difference between MSTRG and STRG?

ttgump commented 5 years ago

It seems this issue appears in the new version. There is no problem for Stringtie 1.3.x.

gpertea commented 5 years ago

@ttgump, that is incorrect (I hope!), prepDE.py should work with version 2.0 just the same, I tested it on the original RNA-Seq protocol data (chrX) and it worked as expected.

[ 10/23 EDIT ] @ttgump you were in fact correct, apologies for the above, it turns out that the chrX data set was too simple to expose the problem in stringtie 2.0 which indeed outputs additional STRG entries when the -e option is given -- which should not happen. A fix is in the works.

Admittedly it might be tricky to get it running properly, especially when a parent folder is provided as input (instead of a space delimited table with sampleID and GTF_path), as it pulls in other .gtf files in that directory tree that you might not have wanted to use for prepDE.py.

For the regular, full stringtie->prepDE pipeline, prepDE is expecting to see a MSTRG prefix for most transcripts - because this prefix represents the transcripts "merged" from multiple samples, and the estimation run should be performed against that common set of "merged" transcripts. The fact that you get an error for a STRG prefix there seems to indicate that prepDE.py is trying to process one of the "initial" (1st phase ) output files of stringtie instead.

Perhaps in the estimation phase you forgot to provide the -e option ? That would explain why your transcript has a STRG prefix - because that prefix is assigned when the -e option is not given and thus StringTie assembled the read alignments into newly reconstructed transcripts, and those get the STRG prefix.

For further diagnosing prepDE running issues I'd suggest to take a look at https://github.com/gpertea/stringtie/issues/234#issuecomment-541494630 where I posted a modified version of the prepDE.py script which should help with finding potential issues with the input GTF files (if they were not all generated as expected). Let me know if you encounter any problems running that version on your data.

yan-cheng-lin commented 5 years ago

Hi, I got the same error when using the new version 2.0.3, but version 1.3.6 is OK. My data is RNA-Seq of a tomato bi-parental population. Therefore, I only used the RNA-Seq data of the parental lines (parents) to generate merged gtf, and used it to estimate the gene expression of the progenies (offspring). Then, I got the same problem when prepDE.py running through some progenies. Should I generate and merge gtfs for all samples? In addition, I also found that the gtf generate from these two versions were quite different, could you explain the main difference? Thank you !!

ttgump commented 5 years ago

@gpertea When I use new version 2.0, the prepDE.py print the error message: Traceback (most recent call last): File "prepDE.py", line 255, in geneDict.setdefault(geneIDs[i],{}) #gene_id KeyError: 'ENST00000647043.1' Then I transfer to use version 1.3.7, it works well. I have spent one afternoon to figure this out.😂

gpertea commented 5 years ago

Then that's a problem we'd surely like to fix. Stringtie v2 did improve assembly accuracy in our tests (to answer @andy3404's question), but there were sections of the code which were a complete rewrite compared to v1.3 (some of them due to adding support for long reads) and that accounts for the differences in the output and occasionally, it seems that the output when -e option is used (which means "only estimate the abundance, do not assemble"), in v2 may sometimes "drop" (i.e. not report) some redundant (or uncovered) reference transcripts which in turn makes prepDE.py fail.

But that should not happen and we would like to fix it. It would be very helpful if any of you can provide us with such an example, where running stringtie v2 with -e option seems to have dropped (not report) a transcript from the merge file that was given with -G option.

As an example: @ttgump, before that error message you should have seen the sample name printed there, right? (it would be helpful if you had shown the entire prepDE message here) Was that the 2nd sample which was printed there as being processed by prepDE ? Because that means that the first sample is actually the one missing that ENST00000647043.1, so the problem would be with the stringtie -e run for that first sample (even though prepDE does not complain about that one), because that was the one dropping ENST00000647043.1 from the output of stringtie -e, even though that transcript is in the -G reference file. Do you think you can identify that first sample and perhaps share with me the corresponding .bam file and the file that you used with -G when you ran stringtie -e on that sample? That would allow us to properly debug this issue.

ttgump commented 5 years ago

Yes. The error happens on the second sample. How could I share the bam file with you? It is several Gbs. Thanks.

On Oct 16, 2019, at 8:07 AM, Geo Pertea notifications@github.com wrote:

 Then that's a problem we'd surely like to fix. Stringtie v2 did improve assembly accuracy in our tests (to answer @andy3404's question), but there were sections of the code which were a complete rewrite compared to v1.3 (some of them due to adding support for long reads) and that accounts for the differences in the output and occasionally, it seems that the output when -e option is used (which means "only estimate the abundance, do not assemble"), in v2 may sometimes "drop" (i.e. not report) some redundant (or uncovered) reference transcripts which in turn makes prepDE.py fail.

But that should not happen and we would like to fix it. It would be very helpful if any of you can provide us with such an example, where running stringtie v2 with -e option seems to have dropped (not report) a transcript from the merge file that was given with -G option.

As an example: @ttgump, before that error message you should have seen the sample name printed there, right? (it would be helpful if you had shown the entire prepDE message here) Was that the 2nd sample which was printed there as being processed by prepDE ? Because that means that the first sample is actually the one missing that ENST00000647043.1, so the problem would be with the stringtie -e run for that first sample (even though prepDE does not complain about that one), because that was the one dropping ENST00000647043.1 from the output of stringtie -e, even though that transcript is in the -G reference file. Do you think you can identify that first sample and perhaps share with me the corresponding .bam file and the file that you used with -G when you ran stringtie -e on that sample? That would allow us to properly debug this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Gin-Wang commented 5 years ago

i met same question using prepDE.py updated 8 days ago. Here is the error printed in the screen.

Error: could not locate transcript STRG.2470.1 entry for sample neg2 Traceback (most recent call last): File "6.prepDE.py", line 281, in geneDict.setdefault(geneIDs[i],{}) #gene_id KeyError: 'STRG.2470.1'

maybe the length of gtfs are not equal to each other?

gpertea commented 5 years ago

@Gin-Wang did you run the new prepDE.py with the -v option ? That should show the name/path of each sample as it is processed. I'd be interested in the BAM file corresponding to the entry which appears to be successfully processed by prepDE before the sample that caused the error. For example, in your case, the sample neg2 seems to have made prepDE fail, but assuming you have a sample neg1 just before that -- it would be great if you could share the .bam file corresponding to that neg1 sample (not neg2 !). Additionally, I'd need the -G reference file that was used for all those samples with stringtie -e , before you ran prepDE. The name of that file can be found in the first line of the GTF being processed by prepDE for each sample.

@ttgump and others willing/able to share such BAM file, could you upload the file to Google Drive or a similar service which allows sharing the download link (so you can email/share the download link with geo.pertea@gmail.com). Alternatively, send me an email so I can send you instructions for uploading the files to our FTP server, though lately we've had some problems with large uploads on that ftp server, but we could give it a try.

Gin-Wang commented 5 years ago

@gpertea thanks for your answer, i tried it with "-v" just now and met the similar question and here is the log:

python2 6.prepDE.py -i 5.ballgown -v

processing sample neg1 from file 5.ballgown/neg1/output_merge.gtf processing sample pos1 from file 5.ballgown/pos1/output_merge.gtf Error: could not locate transcript STRG.10345.1 entry for sample pos1 Traceback (most recent call last): File "6.prepDE.py", line 281, in geneDict.setdefault(geneIDs[i],{}) #gene_id KeyError: 'STRG.10345.1'

i'd like to share with you the files as you mentioned above, but it may take a while. Actually it printed similar error when i put my pos1 and neg1 files and those bam files are smaller than others, so i will upload those files. Thanks for your answers and i will share you the links to download the files as soon as possible.

gpertea commented 5 years ago

Thank you for your help with this problem. From that log it looks like the files that I would need should be shown in the first line of the file: ./5.ballgown/neg1/output_merge.gtf That line should show the stringtie command line that was used to generate this output_merge.gtf file. That should show the .bam file that I would need in order to investigate this, and also the reference (guide file) that follows the -G option on that line -- I would need that too.

To repeat and clarify: I would NOT need multiple .bam files, just one should suffice (the one for the neg1 sample in your example here, the one for which prepDE did NOT fail).

LunaPatton commented 5 years ago

@gpertea today I ran the new script and met the simliar problem like before. Attach the code below:

!/bin/sh

PBS -q long

PBS -l nodes=1:ppn=1

cd /media/disk1/sijf/yanze/RNA-Seq/stringtie_file/abundances python2.7 /media/disk1/sijf/yanze/RNA-Seq/software/stringtie/prepDE.py -i count_list.txt -v 2>&1 | tee prepDE.log

And the error is :

processing sample 120076-1_a from file 120076-1_a.gtf processing sample 120076-2_a from file 120076-2_a.gtf processing sample 130490-1_a from file 130490-1_a.gtf processing sample 130490-2_a from file 130490-2_a.gtf processing sample 140191-1_a from file 140191-1_a.gtf processing sample 140191-2_a from file 140191-2_a.gtf processing sample 140195-1_a from file 140195-1_a.gtf processing sample 140195-2_a from file 140195-2_a.gtf processing sample 140386-1_a from file 140386-1_a.gtf processing sample 140386-2_a from file 140386-2_a.gtf Error: could not locate transcript MSTRG.2936.2 entry for sample 140386-2_a Traceback (most recent call last): File "/media/disk1/sijf/yanze/RNA-Seq/software/stringtie/prepDE.py", line 281, in geneDict.setdefault(geneIDs[i],{}) #gene_id KeyError: 'MSTRG.2936.2'

Then I view the prepDE.log and nothing in this file. By the way, when I ran the old verison of this script, the error was same as I ran today. Could you tell me reason and solve it? Thanks a lot : )

gpertea commented 5 years ago

Thank you so much @Gin-Wang I took a look at the data you uploaded and I can confirm now that the problem is real and correctly reported by users in this thread (and in #234), stringtie 2.0 with the -e option outputs a few STRG entries which should NOT be there, as the output should only contain the MSTRG entries that were given in the -G file.

We will fix this -e problem soon and report back here, until then please use stringtie v1.3.6 for quantification via prepDE (for short reads there shouldn't be a significant difference).

Gin-Wang commented 5 years ago

i'm glad that it may help, and thanks again for the great software stringtie and your work

gpertea commented 5 years ago

The main author fixed this issue in v2.0.4 which was just released (see https://github.com/gpertea/stringtie/releases/tag/v2.0.4). Now stringtie v2.0.4 should behave as expected with the -e option (just like v1.3.x) so the quantification pipeline using prepDE.py should no longer stumble at the 2nd file.

MarinaMann commented 5 years ago

Hi, I thought I'd add my 2 cents to this discussion since it's still listed as "open".

I was getting the same error messages that everyone has been showing above. As suggested, I counted lines in each of my ballgown/sample/sample.gtf files using wc -l. Each of these files should have had the same number of lines as the "stringtie_merged.gtf" file created during stringtie --merge, but did not. I was also following the protocol outlined by Pertea et al., 2016, which used stringtie v1.2.2 for their analysis.

However, I am using two versions of stringtie: v1.3.5, and v2.0.3. I found a solution to the problem of the following: In Pertea et al., they use 'stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/sample1/sample1.gtf sample1.sorted.bam'.

BUT, for both of the stringtie versions I have been using, this command only works as expected if the elements are shuffled: 'stringtie sample1.sorted.bam -G stringtie_merged.gtf -l sample1 -o ballgown/sample1/sample1.gtf -p 8 -e -B'.

It seems that following the published command line causes the newer versions of the program to miss some element of the input, which allows the output to not reply on -e as it should. Even though the man/help page of later versions of stringtie clearly shows the proper organization of input files being different than that used by Pertea et al., I think this is not something people would typically think to look at if they're just following the published protocol verbatim. @gpertea do you think you could build in an error that tells people the organization of their input order is wrong if their ballgown directory file line counts don't match the merged-files line count?

I am super happy that this program exists at all, and I also very happy with the amount of feedback and help available on this github issues page. Please keep up the good work, and I'd love to know if my solution is consistent on your end. Thanks.

gpertea commented 5 years ago

@MarinaMann, thank you for the nice words but that problem is actually unrelated to the stringtie -e issue in this thread. In fact the order of the command line parameters should never matter for StringTie (due to the way that argument parser is designed). If it did make a difference as it seemed to have happened in your case, that would have been an ugly bug as we've never intended it! But this was something else, a bit more insidious.. see below.

Not sure if it's visible in your browser, but do you notice any difference between these two lines below?

–e –B
-e -B

(in my browser it looks like the dashes in the first line are a bit longer and closer to the letter that follows them, compared to those on the second line).

There seems to have been a editing/typographical (?) error which affected our protocol paper when it was submitted for publishing, where sometimes the plain 'dash' or 'minus' character ('-') in our command lines got unexpectedly replaced with the "em dash" typographical symbol ('–'). I guess some word processing/publishing software like Microsoft Word even makes this kind of substitution automatically in some contexts. The issue was previously reported when users copied gffcompare commands from the protocol paper (see gpertea/gffcompare#3) but now I see it affected other command lines too.. Sorry about that confusion. Unfortunately this means that one should not be using simple copy&paste from the paper (or if one does, one should edit the line before running it in order to delete all the dashes and replace them with the regular '-' character)

nurie05 commented 1 year ago

Hello, I'm finding the same issue when using version 2.2.1., the obtained GTF files when using -e -B had less transcripts than the file use as -G and give problems when using prepDE.py . However, version 2.1.4 seems to work just fine.

gpertea commented 1 year ago

@nurie05, could you please tell me:

nurie05 commented 1 year ago

@gpertea regarding you questions:

MoonyScotch commented 1 year ago

Hello, I am getting the same problem when trying to use prepDE.py3 on assembly with -L.

My error is: Error: could not locate transcript ENST00000662089.1 entry for sample <sample_name> Traceback (most recent call last): File "/<path_to_file>/prepDE.py3", line 282, in <module> geneDict[geneIDs[i]][s[0]]+=v[s[0]] KeyError: '<sample_name>'

My estimation command is stringtie2 -o $assembled -p 2 -m 100 -L -e -b $stats -G /<path_to_reference>/gencode.v36.primary_assembly.annotation.gtf $BAM My quantification command is python /<path_to_file>/prepDE.py3 -i ${SAMPLE}

Could you please offer suggestions for how to troubleshoot please? I have successfully used prepDE.py3 on short read data from the same sample without an issue. Is this error specific to -L option?

MoonyScotch commented 1 year ago

Hi @gpertea, I just want to update that I figured out the issue with my problem. It turns out, when I used stringtie 2.2.1 from git clone on the github page, it installed a version that is not fully updated. When I reinstalled from your website, or downloaded directly from the latest release, the error I reported above is resolved. It is slightly misleading that both ways of installations gave stringtie 2.2.1. Hopefully this can help others solve their errors too.

gpertea commented 1 year ago

Thank you @MoonyScotch for the update - actually the github version is a couple of commits ahead of the "release" 2.2.1 version, and those changes made the -e option work more consistently for long/hybrid reads quantification, but (re-)introduced this issue for short reads.. as you and @nurie05 reported. Still haven't had the time to look more closely into that problem to solve it.
But indeed, as you observed, using the older 2.2.1 "official" release works OK for short reads (but creates a similar issue for long/hybrid input..).

smallfishcui commented 5 months ago

Hi, I am using the v2.2.3, and the problem persists. Should I switch back to 2.2.1? I m using only short reads.

thanks, Cui

gpertea commented 5 months ago

@smallfishcui indeed unfortunately I can confirm that this issue is back in v2.2.3 -- when -e is used with short reads, not all reference transcripts are reported in the output.
If you are using only short reads, I advise switching back to v2.2.1 until this short-read quantification issue is resolved. Sorry about the inconvenience.