ryanmelnyk / PyParanoid

Rapid and scalable homolog identification for bacterial genomes
MIT License
32 stars 7 forks source link

Empty Directories after Running BuildGroups.py #2

Closed elizabethmcd closed 6 years ago

elizabethmcd commented 6 years ago

Thanks for the awesome tool!

I have 70 Deltaproteobacteria genomes that I've run the BuildGroups.py script on, and have that working I think. However, some of the directoreis are empty after running this script such as the aligned and hmms directories in the master outfolder. Are these empty until the PropagateGroups.py script is run? I was trying to just go straight from BuildGroups.py to pulling out orthologs with IdentifyOrthologs.py, but that depends on the hmms directory, which is currently empty.

Thanks!

ryanmelnyk commented 6 years ago

I think I found the issue. I'm guessing you were using BuildGroups.py without the --use_MP option?

I'm testing the fix right now and will push a new version hopefully later today.

elizabethmcd commented 6 years ago

I could be wrong, but I'm pretty sure I used the --use_MP option. I have another colleague that's been testing the pipeline and I know he used the --use_MP option and he also gets some empty directories from just running BuildGroups.py.

ryanmelnyk commented 6 years ago

I think there's two different issues going on. Right now, if "--use_MP" is not flagged, the function that builds HMMs is never called. If it is flagged, it does build the HMMs but some directories will be empty if you use the "--clean" option.

Is there anything in the "all_groups.hmm" file for either you or your colleague?

elizabethmcd commented 6 years ago

That file is empty as well. I don't think we used the --clean option either.

ryanmelnyk commented 6 years ago

I pushed a new version to GitHub and PyPI with a bugfix for BuildGroups.py so upgrade to v0.2.2 and try running again.

Please use the --verbose arg and attach the output

You will also have to make an empty file called prop_strainlist.txt in your output directory before running IdentifyOrthologs.py if you don't run PropagateGroups.py - otherwise that part should work ok.

I'll leave this issue open but I may not have time to follow up until later in the week

elizabethmcd commented 6 years ago

Hi Ryan,

I updated to v0.2.2 and reran the BuildOrthologs.py script. I am still getting empty directories. I've attached the output file, pypar-output.txt that I got with the --verbose flag. I'm planning on running the PropagateGroups.py script with a few new genome files that I have, so hopefully I can keep going with the analysis. I'll let you know if I have any problems with that.

ryanmelnyk commented 6 years ago

Hi Elizabeth,

Thanks for attaching the output - looks like everything is going good until the final chunk of the pipeline which depends on the executables for several programs being installed.

Can you double-check that all of the executables required work correctly and are accessible in your $PATH? They would be:

cd-hit muscle hmmemit hmmbuild

If you used homebrew, be particularly careful about cd-hit. It only works on some versions of mac os x. Just try to run cd-hit from bash and see if you get an error message.

elizabethmcd commented 6 years ago

Ah I think that's it. I'm on a linux machine and cd-hit was put in a weird place. I'll try running it again.

elizabethmcd commented 6 years ago

Problem solved! There was a problem with cd-hit and my path, but it was a bit more interesting than it just not being in my path. Apparently when you install cd-hit with apt-get, a hard link is created to the cd-hit executables as cdhit without the dash. Everything is placed in my path, but since PyParanoid is calling cd-hit with a dash, it couldn't find it. I suspect that possibly other Linux users that install things with apt-get might get this issue, and it takes a while to figure out (at least it did for me). The fix was just creating a soft link to cd-hit with a dash, and everything works perfectly. I ran IdentifyOrthologs.py without propagating new groups and just renamed my strainlist as prop_strainlist and everything looks good there as well.

ryanmelnyk commented 6 years ago

Awesome! Thanks for your detailed description and patience - I'll add more robust checking and error reporting for non-python dependencies in a future update.