vlasmirnov / MAGUS

Graph Clustering Merger
MIT License
32 stars 13 forks source link

Example fails invoking mafft #12

Open rcedgar opened 3 years ago

rcedgar commented 3 years ago

Cloned git repo today clean Ubuntu (AWS c5a.4xlarge instance with Ubuntu 20.04). Installed dendropy dependency.

cd example
python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt

# ...some output deleted...

subprocess.CalledProcessError: 
Command '/home/ubuntu/magus/MAGUS-master/tools/mafft/mafft --localpair --maxiterate 1000 --ep 0.123 --quiet --thread 16 --anysymbol /home/ubuntu/magus/MAGUS-master/example/outputs/decomposition/initial_tree/skeleton_sequences.txt > /home/ubuntu/magus/MAGUS-master/example/outputs/decomposition/initial_tree/temp_initial_align.txt' 
returned non-zero exit status 126.
vlasmirnov commented 3 years ago

Thanks a lot for writing. Regarding your issue, there are two possibilities that come to mind: 1) Make sure you've permissioned MAFFT (and the other tools that are packaged with MAGUS) 2) You might need to replace the packaged MAFFT executable with one built for your system (https://mafft.cbrc.jp/alignment/software/)

Please let me know if any of this helps. Also, there might be more information in the error log.

rcedgar commented 3 years ago

Vlad -- Thanks for the quick reply.

The install instructions do not mention setting permissions.

The execute bit was not set for the main mafft script, but setting it did not fix the problem, I get the same error after the execute bit is set.

FYI, this is for comparative validation against other MSA methods and I have limited patience for trouble-shooting buggy code / buggy install instructions here. If you can provide complete instructions for setting up MAGUS on a clean Ubuntu 20.04 I will be glad to include MAGUS in the comparison. A simple way for you to fix the install instructions is to install on a clean Ubuntu 20.04 on an AWS t2.micro instance, this is free tier so will not cost anything.

vlasmirnov commented 3 years ago

Sounds good, I'll take a look at what's going on with AWS when I get the chance. I apologize for the inconvenience. My guess is that this MAFFT distribution was for debian, although it seems to work on my home Ubuntu.

In the meantime, if you were planning to include MAFFT in your comparison, the easiest thing to do would be to overwrite MAGUS's packaged MAFFT with your working MAFFT copy (the mafft script goes into tools/mafft/mafftdir/bin, and the binaries go into tools/mafft/mafftdir/libexec). Or, in configuration.py, change the "mafftPath" line to wherever the mafft script is installed.

Alternatively, if you were planning to include PASTA in your comparison, MAGUS copies PASTA's directory structure for MAFFT, so you can just copy PASTA's MAFFT installation directly over.

rcedgar commented 3 years ago

I do plan to include stand-alone MAFFT, but I don't see the relevance -- I would install it and run it on a totally separate machine (i.e. separate AWS instance) without MAGUS. What is PASTA? Maybe I should ask which Warnow lab method(s) I should be testing for large input datasets?

vlasmirnov commented 3 years ago

PASTA (https://github.com/smirarab/pasta) is an alignment method for large datasets, which grew out of a previous method called SATe II. MAGUS grew out of PASTA in turn. In a sense, PASTA is "SATe III" and MAGUS is "SATe IV". For very large datasets, another method to consider is UPP (https://github.com/smirarab/sepp). It tends to be faster than PASTA/MAGUS, but accuracy tends to suffer.

The best choice of method would depend on how large your datasets are. MAGUS and PASTA both use MAFFT -linsi internally, so if your dataset is a few hundred sequences, then standalone MAFFT -linsi should give about the same result. For larger and more heterogeneous datasets, the other methods tend to give better results.

rcedgar commented 3 years ago

Great feedback thanks. My main interest is in aligning 140k RdRP sequences for novel RNA virus species recently discovered by mining the SRA https://www.biorxiv.org/content/10.1101/2020.08.07.241729v2, which has got me interested in MSA methods again and I'm working on a new algorithm and a new benchmark. The RdRPs are an ideal real-world case for applying and validating methods like MAGUS because there is an independent check on the alignments by identifying conserved motifs https://github.com/rcedgar/palmscan.

vlasmirnov commented 3 years ago

I see, that makes sense. I'd be very curious to see how well MAGUS performs on biological datasets different from those that we used to test it. If you'd like to obtain a MAGUS alignment but are having issues getting it to work in your environment, I'd be happy to try aligning your dataset on our campus cluster.

MinhyukPark commented 2 years ago

bumping an old thread but I encountered the same issue while running MAGUS in a vm and instead of setting the individual scripts as executable, chmod -R +x ./tools/mafft/ seemed to do the trick for me.

rmukaila commented 2 years ago

I had same issue recently, but it turns out MAGUS has issues with presence of special characters in sequences headers. A friend reported that preprocessing sequence headers to look as simple as the example sequences file in the MAGUS repo fixed it. That's if you are are able to run the example sequences without problems

lrauschning commented 1 year ago

Ran into a similar issue while writing an nfcore module for MAGUS. Just chatted with @mashehu at the NFCore hackathon and we were able to figure out the issue is caused by the chown version of busybox (which runs on the AWS machines) not having the --from parameter. Writing here if anyone else comes across this in the future, took quite a while to figure out.