peterjc / pico_galaxy

Galaxy tools and wrappers for sequence analysis
17 stars 25 forks source link

rxlr_motifs or rxlr_venn_workflow? #37

Closed Neato-Nick closed 3 years ago

Neato-Nick commented 3 years ago

Hi, I'm exploring galaxy, just got my own instance spun up.

In the public toolshed, I found rxlr_venn_workflow. Does this execute the same things as rxlr_motifs.py in this repository? Or, are the outputs slightly different somehow?

peterjc commented 3 years ago

The workflow calls the RXLR tool and other tools to plot a Venn diagram. It was more a proof of principle for how we might share workflows on the Galaxy Tool Shed than something very practical.

You probably want just the RXLR tool.

Neato-Nick commented 3 years ago

Thanks. The venn_workflow is taking a while to install through the toolshed so I did the manual install of these tools in parallel. Running my data I got an error. If you want me to post as a separate issue that's fine, but if I was hoping it was a minor thing you see all the time

File "~/opt/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/7de64c8b258d/tmhmm_and_signalp/tools/protein_analysis/rxlr_motifs.py", line 273
    print "%s for %i sequences:" % (mode

Edit: This was actually an issue with the toolshed version, running Whisson

peterjc commented 3 years ago

I suspect that's a Python 2 bit of code being run under Python 3, although I'd like to see more of the error context to be sure. The RXLR tool has been updated to work under Python 3, but the workflow is most likely requesting an older Python 2 only version of the tool.

Tricky.

The workflow ought to be fine if you install the latest version of the RXLR tool. But perhaps I should update the workflow...

Neato-Nick commented 3 years ago

Interesting, I think I was running an older version of the rxlr script. I updated it and all of the dependencies from toolshed to the most current version. Now I get a different error stemming from signalp, I guess that's progress. It looks less like a Python 2 vs 3 error but I could be wrong. Is there a way to force all python scripts called by rxlr_motifs.py to run under python 2.7?

Traceback (most recent call last):
  File "~/opt/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/a19b3ded8f33/tmhmm_and_signalp/tools/protein_analysis/signalp3.py", line 175, in <module>
     n=FASTA_CHUNK, truncate=truncate, max_len=MAX_LEN)
  File "~/opt/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/a19b3ded8f33/tmhmm_and_signalp/tools/protein_analysis/seq_analysis_utils.py", line 125, in split_fasta
     records.append(iterator.next())
AttributeError: 'generator' object has no attribute 'next'
Error 256 from SignalP:
python /~/opt/galaxy/database/shed_tools/toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/a19b3ded8f33/tmhmm_and_signalp/tools/protein_analysis/signalp3.py euk 0 1 /~/opt/galaxy/database/objects/d/4/6/dataset_d46651f7-b4d1-45d0-a057-e8978f37ee77.dat.fasta.tmp ~/opt/galaxy/database/objects/d/4/6/dataset_d46651f7-b4d1-45d0-a057-e8978f37ee77.dat.tabular.tmp
peterjc commented 3 years ago

Progress indeed. Sadly another Python 2 to 3 pain point, this one was fixed 3 years ago in 85915a57c568b213589db00037585b11ff52bf42

So again, hopefully all you need to do is update the signalp wrapper?

I'm unsure if there is a hack to specify Python 2.7, but I really don't want to go that route since the current versions of these wrappers should all work under Python 3

Neato-Nick commented 3 years ago

Ah, indeed I see the tool ID in my error is toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/rxlr_motifs/0.0.14 and that commit you linked brought it up to 0.0.16 And yet, my install shows I should be at signalp3 0.0.19 image I think I'm having galaxy learning curve issues

peterjc commented 3 years ago

Whoops. I never updated the main Tool Shed since 2017-09-21, so it didn't have the Python 3 fix for the "next" problem you ran into. Done now.

My apologies - there was a time when these wrappers were getting lots of small fixes, so doing an update every change was over the top, but my Galaxy work has trailed off since then.

Neato-Nick commented 3 years ago

Thank you! Got past the python errors. Now it's actually SignalP... the dreaded error running HOW.

Very strange, I can run on the SignalP test data fine (~/opt/signalp-3.0/test/test.seq), but when it's my own proteins I get the 'error running HOW'.

Neato-Nick commented 3 years ago

So - this does not help your development project, but all three models are working for me now! I'm using the script from this repo rather than the galaxy toolshed, and I'm just running it using command-line usage rather than through galaxy like so: python2.7 rxlr_motifs.py ~/opt/signalp-3.0/test/test5.seq 2 Win2007 test5.win.out Even though I called py2.7 directly, I had to fix one more python error near line 111 of rxlr_motifs.py, but I just followed this solution and it was no sweat.

I cannot reproduce the signalp errors. Even just running signalp directly from the dir I installed it in, I was getting that HOW error. But magically, it works when running my proteins via your script's signalp call.

Neato-Nick commented 3 years ago

Last thing, can I just ask about the Whisson output? I assumed that the union(hmm+re)=Y but it's not quite adding up. Here are my numbers: Y = 290, hmm = 251, neither = 149423, re = 27 Edit: Oh, is it union(hmm+re)=Y, and then hmm labels are genes only found with hmm and re are genes only found with regex?

peterjc commented 3 years ago

Could you expand on what you had to change in rxlr_motifs.py about StopIteration? A pull request would be even better of course.

I never did get to the bottom of "error running HOW", any insights are welcome on #24.

As to the Whisson output, I think you've got it now. Adding those four numbers should match the total sequence count.

Neato-Nick commented 3 years ago

I'm glad you asked me to expand, I went back and found it was actually in seq_analysis_utils.py originally in line 111 not in rxlr_motifs.py. Here's what mine looks like now, lines 111-114 are what I added

105         if max_len and len(seq) > max_len:
106             raise ValueError(
107                 "Sequence %s is length %i, max length %i"
108                 % (title.split()[0], len(seq), max_len)
109             )
110         #yield title, seq
111         try:
112             yield title, seq
113         except StopIteration:
114             return
115 #    raise StopIteration

Part of me wonders if the "error running HOW" is related to filesize or number of sequences. Their test5.seq works, and an input set of my proteins reduced from regex & HMM searches works, further reduced splitting up the tasks as you described in https://github.com/peterjc/pico_galaxy/issues/24. You first thought it was related to your temp files but ruled out 'user error' on your part of splitting them up. Maybe it is number of sequences per input file? That being said, I also tried running all this in WSL on my windows machine, and my signalP install was giving me error running HOW even on their test.seq which is one sequence, so it could be some combination of environment & input file

Edit: I made a pull request implementing the code I pasted above https://github.com/peterjc/pico_galaxy/pull/38

peterjc commented 3 years ago

Thanks for #38, hopefully that's the Python 3 stuff dealt with.

I doubt we'll solve #24 and the "error running HOW" today :(

Can we close this issue, or do you think the workflow needs updating?

Neato-Nick commented 3 years ago

Nope we can close it! Thank you for all the responses. I doubt I would've gotten through the python stuff if you had asked me to open a new issue with each new error. I really appreciate you working closely with me.