mskwark / PconsC3

Faster, more accurate and entirely open source method for predicting contacts in proteins
GNU General Public License v2.0
12 stars 3 forks source link

PconsC3

Faster, more accurate and entirely open source method for predicting contacts in proteins

If you use PconsC3 please cite:

Prerequisites

If Julia, Python and CD-HIT are in your search path, you are set to go. Otherwise, you need to either add them to the path, or modify the necessary scripts.

For Julia: rungdca.py and runplm.py For CD-HIT: alignmentstats.py

Installation

  1. Check out PconsC3 from GitHub
    git checkout https://github.com/mskwark/PconsC3.git
  2. Install Julia packages. Start Julia and install:
    • NLopt.jl.
      julia> Pkg.add("NLopt")
    • GaussDCA
      julia> Pkg.clone("https://github.com/carlobaldassi/GaussDCA.jl")
      julia> Pkg.clone("https://github.com/carlobaldassi/ArgParse.jl")
    • PlmDCA. Install a (very slightly) modified version of PlmDCA.jl
      julia> Pkg.clone("https://github.com/mskwark/PlmDCA")
  3. Make sure the prerequisites are installed.

Running the software

Before the first run, download and unpack the trained Random Forests in the same directory as the PconsC3 code. You should have six subdirectories named tforest0, tforest1,...tforest5. You can get them from Google Drive or a local mirror (378MiB). If you just want to give PconsC3 a try or for some other reason need a smaller archive, feel free to download the mini-version either from Google Drive or local mirror (38MB), being advised that this version may not perform as well as the fully-fledged one (but will be roughly 10x faster!).

> tar -xJf pconsc3-forests.tar.xz

You may want to put them on a fast filesystem (on a relatively recent Linux machine /dev/shm/ is a good choice and by default PconsC3 will look for them there (i.e. it will check if /dev/shm/tforest0 etc. exist and are sane). As a fallback it will look in the same directory ./predict.py is located. If you want to change it, you need to modify forestlocation variable in the head of ./predict.py.

To run PconsC3, you need to have at hand:

You can name these files any way you want, but assuming your alignment is named myprotein.fas, your contact priors are named external.RR, secondary structure prediction file is named psipred.ss2 and RSA is named netsurf.rsa, to run the prediction do the following.

  1. Infer evolutionary couplings with GaussDCA:

    ./rungdca.py myprotein.fas

    It will produce a file named myprotein.gdca

  2. Infer evolutionary couplings with plmDCA.jl:

    ./runplm.py myprotein.fas

    It will produce a file named myprotein.0.02.plm20

  3. Compute alignment statistics:

    ./alignmentstats.py myprotein.fas

    It will produce a file named myprotein.stats

  4. Run PconsC3:

    ./predict.py myprotein.gdca myprotein.0.02.plm20 external.RR netsurf.rsa psipred.ss2 myprotein.stats myprotein.fas outputfile

    This will run for a while, but will provide you with estimates of running time. It will result in a number of intermediate files being generated: outputfile.l0, outputfile.l1...outputfile.l5 and an outputfile.RR containing final predictions in RR format (by default only non-local prediction are output).

Parallel version and HDF5 support (recommended, even for non-parallel usage)

The parallel version with HDF5 support drastically reduces IO and computation time, while not changing the output in any way. To set it up make sure h5py and Cython are in your PYTHONPATH. You can install the packages via pip:

pip install h5py
pip install Cython

Then you need to convert the forest data in your PconsC3 root directory into HDF5-files:

cd <PconsC3 root directory>
python convert_to_hdf5.py .

After successful conversion you can safely remove the folders containing the forest data:

find tlayer* ! -name '*.hdf5' -type d -exec rm -r {} +

And finally compile the Cython script:

python setup.py build_ext -i

After that you can run the fast version of PconsC3:

./predict-parallel-hdf5.py myprotein.gdca myprotein.0.02.plm20 external.RR netsurf.rsa psipred.ss2 myprotein.stats myprotein.fas outputfile [NumberThreads]

Making PconsC3 run faster

There are a few parameters in ./predict.py that can be tweaked, notably:

Help and Support

If you run into any problems with the software or observe it performing poorer than expected, we would appreciate an email to Marcin J. Skwark (firstname@lastname.pl or firstname.middleinitial.lastname@vanderbilt.edu).