LSA for Genomes and Metagenomes...it's fun! The pipeline is still under active development so expect things to change quickly. Also, we are working on a preprint with lots of details and usage examples. Stay tuned!
The LSA pipeline has a fair number of dependencies, so we recommend that you just use the Docker image. You follow this link to get Docker installed on your computer. See the section on using LSA to see how to pull an image and run LSA with Docker.
If you are adventerous, you can try to install the LSA pipeline from source. (Again, we strongly suggest just using the Docker image).
The pipleline has the following dependencies.
You'll need to have interpreters for the following languages installed on your machine.
waf
script to install redsvd
. Note: We have found MiniConda not to work with waf
, but the system Python works fine.)If you need to install RubyGems, please go here.
These will be installed with Bundler. See the section on installation.
See the documentation for how to install them, but you likely can run something like this
$ Rscript -e "install.packages('gplots', repos = 'http://cran.us.r-project.org')"
$ Rscript -e "install.packages('ape', repos = 'http://cran.us.r-project.org')"
Assuming you have met all the dependencies, you need to get the pipeline code.
You can clone the repository like this:
$ git clone --recursive https://github.com/mooreryan/lsa_for_genomes.git
Note: don't forget the recursive flag!
This will give you the master branch which is continuously updated. If you want a stable release, things are a bit more complicated since the git repo has submodules.
First, go and download the lastest release.
Then untar the file.
$ tar xzf lsa_for_genomes-0.11.4.tar.gz
Next, you need to get the code for calculating the SVD, because the release doesn't contain the redsvd
submodule.
$ cd lsa_for_genomes-0.11.4/vendor
$ git clone https://github.com/mooreryan/redsvd.git redsvd/
LSA needs specific versions of certain ruby gems. First you must have the bundler gem installed. Install it like so:
$ gem install bundler
Then in the source directory run
$ bundle install
which will manage the ruby gems for you.
Note: If you have other versions of the required ruby gems installed, this may break other ruby programs you have. In this case, I recommend you use something like RVM to manage Ruby and various gemsets.
Some parts of the pipeline require compiling. You can do that using make
. From the source directory run
$ make
Test out the pipeline with our toy data set to make sure eveything works okay!
$ make test_lsa
If everything goes well, this will output a folder called output
in the source directory. Check it out!
The first thing you need to do is to call ORFs on your genomes or contigs. You should have a file of ORFs for the most granular level of analysis that you want to do. For example, one file of ORFs per genome, one file of ORFs per sample, or one file of ORFs per contig.
Then you can make a metadata mapping file to do higher level groupings of data. The mapping file is a tab delimited text file with a header line. The first column must be "file name" and match the file names of your input files (without the directory part, e.g., if your file is /home/mooreryan/orfs.faa
, then only put the orfs.faa
part in column 1.) You can have as many additional columns as you wish.
You can see an example here. This mapping file is for the three test genomes.
If you have the docker version, you can use our little helper script to make things easy to run. Here is an example of running the test genomes we provide.
bin/run_lsa -i test_files/*.faa.gz -o output
This will pull the latest Docker image so that you're up to date, and then run the pipeline using the Docker image. If you don't want to update your Docker image, please use the docker run
command as described in the Docker tutorials.
If you installed from source, here is an example of running the pipeline on the three test files.
$ ./lsa.rb -m `which mmseqs` -i test_files/*.faa.gz -o output
Note: In this case mmseqs
was on my path already, so I could do the little backtick trick and pass that to -m
.
And you'll get a whole bunch of output.