gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
366 stars 77 forks source link

preDE.py stops at the second file #234

Closed AishMandya closed 4 years ago

AishMandya commented 5 years ago

@tsznxx @nongbaoting @gpertea Hi I have modified the sample folder to label space file path label space file path but the error persists Each gtf file is inside a subdirectory of the ballgown directory generated by the stringtie -B -e

$ prepDE.py -i sample_1st.txt output: 0 A1_S1 1 A2_S2 Traceback (most recent call last): File "prepDE.py", line 257, in geneDict[geneIDs[i]][s[0]]+=v[s[0]] KeyError: 'A2_S2' similar to issue #232

lelesama commented 5 years ago

Hi,I have the same question ,do you have solved?

AishMandya commented 5 years ago

No, not yet. IT seems to stop working at the second iteration, no matter which file it is. So it may be a glitch in the code or the way I have used stringtie to generate the gtf files. Also, I don't fully understand how the code works so it's definitely inconclusive

lelesama commented 5 years ago

Hi,AishMandya,it actually made me crazy !but when I use older version ,it works! maybe you can try,hope it will help you.

e-lerat commented 5 years ago

Hi everyone Sorry, I won't help. I have the exact same problem. I even run again everything since at first I thought that I didn't have the right genome gtf file. Unfortunately, it still does not work. If you get the answer, I am really interested!

Emmanuelle

AishMandya commented 5 years ago

Thanks ill try that!

On Thu, 29 Aug, 2019, 5:27 AM lelesama, notifications@github.com wrote:

Hi,AishMandya,it actually made me crazy !but when I use older version ,it works! maybe you can try,hope it will help you.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gpertea/stringtie/issues/234?email_source=notifications&email_token=ANACMJHZ5PUNYKILGWECYETQG4GGTA5CNFSM4IPOZ6U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5MZ4WY#issuecomment-525966939, or mute the thread https://github.com/notifications/unsubscribe-auth/ANACMJCPKWQB2FMW4UHEFA3QG4GGTANCNFSM4IPOZ6UQ .

AishMandya commented 5 years ago

Hey, So I haven't exactly found the answer for prepde.py. although, what I did was use tximport to assess the ctab files generated for each sample in stringtie and use that output for DeSeq and edgeR

On Thu, 12 Sep, 2019, 4:58 PM e-lerat, notifications@github.com wrote:

Hi everyone Sorry, I won't help. I have the exact same problem. I even run again everything since at first I thought that I didn't have the right genome gtf file. Unfortunately, it still does not work. If you get the answer, I am really interested!

Emmanuelle

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gpertea/stringtie/issues/234?email_source=notifications&email_token=ANACMJCN76OKSOFHBKNU6ODQJIRVJA5CNFSM4IPOZ6U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6RRV6Y#issuecomment-530782971, or mute the thread https://github.com/notifications/unsubscribe-auth/ANACMJAWYTDTEIQTUQTJYITQJIRVJANCNFSM4IPOZ6UQ .

SofiaZhangtj commented 4 years ago

Did you use the --merge during stringtie step?

AishMandya commented 4 years ago

Hi, Yes, i did use merge, all --merge does is merge all the gtf files to give the commonly hit genes, but no tpm info. The statistics is also relating to the common genes found and not their tpm or log fold change. I ended up writing a code for getting the tpm values and their averages across samples and calculates the log fold change using the the common genes output from --merge. I don't think the method is too reliable because it's not recommended to do any kind of DE On tpm values. But string tie outputs unannotated genes (or sequences) as well, which may be very useful.

Finally, i ended up doing most of my analysis with salmon, tximport and edgeR/deSeq2

Thanks, Aish

On Fri, 4 Oct, 2019, 1:22 PM SofiaZhangtj, notifications@github.com wrote:

Did you use the --merge during stringtie step?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gpertea/stringtie/issues/234?email_source=notifications&email_token=ANACMJCDRCNSIPR4BKYBDN3QM3Y35A5CNFSM4IPOZ6U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAKZ4QY#issuecomment-538287683, or mute the thread https://github.com/notifications/unsubscribe-auth/ANACMJBMJVLXCOK2DVZA6GDQM3Y35ANCNFSM4IPOZ6UQ .

SofiaZhangtj commented 4 years ago

Hi Aish, Thank you very much for your kind answer. I think I finally found my problem. This prepDE.py script is supposed to based on version 1.2, and now the software has been updated many times but the scripts have not. I changed my stringtie version from 2 to 1.3.3, then the script works.

coreyscipione commented 4 years ago

Hi @SofiaZhangtj and @AishMandya , I am using stringtie 1.3.3 with the prepDE.py, to generate files for DESeq2 and I keep clogging up at the error: Traceback (most recent call last): File "/cluster/home/cscipion/scripts/prepDE", line 257, in geneDict[geneIDs[i]][s[0]]+=v[s[0]] KeyError: 'B1pr_S13_L002'

I use have successfully used the sample 'B1pr_S13_L002' in several other comparisons, but this set of samples is rejecting it for some reason.

Any thoughts? Maybe @gpertea can help.

SofiaZhangtj commented 4 years ago

Hi @SofiaZhangtj and @AishMandya , I am using stringtie 1.3.3 with the prepDE.py, to generate files for DESeq2 and I keep clogging up at the error: Traceback (most recent call last): File "/cluster/home/cscipion/scripts/prepDE", line 257, in geneDict[geneIDs[i]][s[0]]+=v[s[0]] KeyError: 'B1pr_S13_L002'

I use have successfully used the sample 'B1pr_S13_L002' in several other comparisons, but this set of samples is rejecting it for some reason.

Any thoughts? Maybe @gpertea can help.

Hi, I found that even the last version 1.3.6 works for me. I think met a same problem with yours at the beginning. That time I didn't use the "-e" parameter in string-tie.

gpertea commented 4 years ago

I'll investigate the possibility that some changes in Stringtie v2 may have affected the compatibility with prepDE.py, but in the past there were a lot of "errors" alleged by users of prepDE.py which were mainly caused by an incorrect usage of the script. To reiterate and clarify: prepDE.py can only be used on a set of stringtie GTF outputs if stringtie was run, for all those outputs:

Also, make sure that no other GTF files (like the reference annotation file) are present in those sub-directories, only the stringtie output GTF files should be found there, as the default mode of operation for prepDE is to scan all the sub-directories there for .gtf files which are all expected to have been produced by stringtie by following the requirements above (-e option, same -G file).

SofiaZhangtj commented 4 years ago

@gpertea Hi Pertea, Thank you for your reply. The output gtf files of stringtie v2 have different lines, but in previous vision it was the same. But the t_data.ctab files remain same as the older version. I think that's why prepDE.py doesn't work for the 2.0 version for my case.

lines number of the older version (GTF file) (Two pairs of identical sequencing data ) 1 2 lines number of the new version : 3 4

Any suggestions will be helpful. Thank you.

coreyscipione commented 4 years ago

I am using the -e and -B option, and there is only one .gtf in the directory. The oddity is really that I have 23 samples (7 triplicates + 2 others). Samples 1-3 have been compared against 4-6, 7-9, 16-18 with no issues. When I compare 1-3 vs 10-12 is the only time I get the previously mentioned error. In this case I am sure everything is set up correctly, I am tempted to think that it’s a stringtie issue and not a syntax problem. Any further suggestions? Thank you all!

error is Traceback (most recent call last): File "/cluster/home/cscipion/scripts/prepDE", line 257, in geneDict[geneIDs[i]][s[0]]+=v[s[0]] KeyError: 'B1pr_S13_L002'

gpertea commented 4 years ago

I've added some consistency checking to the prepDE.py script when reading the input data, it should catch some common usage errors. Could you please download the latest prepDE.py script, place it in your working directory, make sure it's executable and then run it again with the same parameters you used before but this time add the -v option, capturing the output in a file, with a command like this:

./prepDE.py (your parameters here) -v 2>&1 | tee prepDE.log

(Use the link above to get this updated script, or you can also download the attached prepDE.py.gz, copy it into your working directory, gunzip it and make it executable, then make sure you run it with ./prepDE.py)

You can then show the prepDE.log here or email it to me.

AishMandya commented 4 years ago

Thank you ! @gpertea/stringtie reply@reply.github.com

On Mon, Oct 14, 2019 at 9:18 AM Geo Pertea notifications@github.com wrote:

I've added some consistency checking to the prepDE.py script when reading the input data, it should catch some common usage errors. Could you please download the latest prepDE.py script https://raw.githubusercontent.com/gpertea/stringtie/master/prepDE.py, place it in your working directory, make sure it's executable and then run it again with the same parameters you used before but this time add the -v option, capturing the output in a file, with a command like this:

./prepDE.py (your parameters here) -v 2>&1 | tee prepDE.log

(Use the link above to get this updated script, or you can also download the attached prepDE.py.gz https://github.com/gpertea/stringtie/files/3722825/prepDE.py.gz, copy it into your working directory, gunzip it and make it executable, then make sure you run it with ./prepDE.py)

You can then show the prepDE.log here or email it to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gpertea/stringtie/issues/234?email_source=notifications&email_token=ANACMJD4OTLEM7UBLNA32ALQOPTXDA5CNFSM4IPOZ6U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBDI2ZQ#issuecomment-541494630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANACMJDUHRSYOUVBLJR2VATQOPTXDANCNFSM4IPOZ6UQ .

nalcala commented 4 years ago

Hi,

I am having the same issue (error at second file). The log file using the "new" script is: processing sample S001_T from file ./S001_T/S001_T_ST.gtf

processing sample S002_T2 from file ./S002_T2/S002_T2_ST.gtf Error: could not locate transcript S001_T.20797.1 entry for sample S002_T2 Traceback (most recent call last): File "prepDE.py", line 283, in geneDict[geneIDs[i]][s[0]]+=v[s[0]] KeyError: 'S002_T2'

I don't really understand because the previous line in the script geneDict[geneIDs[i]].setdefault(s[0],0) should have created a key for s[0]...

Thanks!

nourelislam commented 4 years ago

I'll investigate the possibility that some changes in Stringtie v2 may have affected the compatibility with prepDE.py, but in the past there were a lot of "errors" alleged by users of prepDE.py which were mainly caused by an incorrect usage of the script. To reiterate and clarify: prepDE.py can only be used on a set of stringtie GTF outputs if stringtie was run, for all those outputs:

  • with the -e option
  • with the same file for the -G option.

Also, make sure that no other GTF files (like the reference annotation file) are present in those sub-directories, only the stringtie output GTF files should be found there, as the default mode of operation for prepDE is to scan all the sub-directories there for .gtf files which are all expected to have been produced by stringtie by following the requirements above (-e option, same -G file).

Is it mandatory to use -e option?? As a matter of fact, I am working on detecting the novel splice sites so I should disregards -e option @gpertea

gpertea commented 4 years ago

This (the OP) seems to be the same issue with #232, so it should be fixed in v2.0.4 release.

gpertea commented 4 years ago

Also same with #238, I'll leave only that issue open for a while, for user confirmation that the problem was fixed in v2.0.4

nalcala commented 4 years ago

@gpertea I am all good now with the new versions, on my side you can close the issue. Thanks a lot!

ElzaFosneca commented 1 year ago

I've added some consistency checking to the prepDE.py script when reading the input data, it should catch some common usage errors. Could you please download the latest prepDE.py script, place it in your working directory, make sure it's executable and then run it again with the same parameters you used before but this time add the -v option, capturing the output in a file, with a command like this:

./prepDE.py (your parameters here) -v 2>&1 | tee prepDE.log

(Use the link above to get this updated script, or you can also download the attached prepDE.py.gz, copy it into your working directory, gunzip it and make it executable, then make sure you run it with ./prepDE.py)

You can then show the prepDE.log here or email it to me. Hi @gpertea,

I am running version v.2.2.1 and I'm getting the same error.

prepDE.log

RJEGR commented 1 year ago

Hi everyone,

Same error for StringTie v2.2.1,

By using the prepDE.py version than @gpertea made the diagnosing is:

prepDE.py -i samples.txt -v 2>&1 | tee prepDE.log

processing sample SRR8956796 from file /home/rvazquez/RNA_SEQ_ANALYSIS/ASSEMBLY/STRINGTIE/QUANTIFICATION/DENOVO_MODE/SRR8956796_eB_dir/SRR8956796_eB.gtf processing sample SRR8956797 from file /home/rvazquez/RNA_SEQ_ANALYSIS/ASSEMBLY/STRINGTIE/QUANTIFICATION/DENOVO_MODE/SRR8956797_eB_dir/SRR8956797_eB.gtf Error: could not locate transcript MSTRG.31643.1 entry for sample SRR8956797 Traceback (most recent call last): File "/home/rvazquez/RNA_SEQ_ANALYSIS/stringtie/prepDE.py", line 284, in geneDict.setdefault(geneIDs[i],{}) #gene_id KeyError: 'MSTRG.31643.1'

Although this issue is closed, no one commented the StringTie v2.2.1 problem is solved using the prepDE.py3

prepDE.py3 -i samples.txt -v 2>&1 ... ..writing transcript_count_matrix.csv ..writing gene_count_matrix.csv All done.

jubiology commented 1 year ago

I also encounter the same problems with all tested versions of Stringtie. When I use the prepDE.py3 script, it gives me a very weird gene count matrix, where samples 2-x show massive zero inflation while sample 1 looks normal. Also the last line does not look like expected:

image

If anybody has any hints on how to solve this please let me know.

Edit: The error disappeared when I ran stringtie without the -x option. Not sure why this option caused the error, but now everything works