Faster, more accurate and entirely open source method for predicting contacts in proteins
If you use PconsC3 please cite:
If Julia, Python and CD-HIT are in your search path, you are set to go. Otherwise, you need to either add them to the path, or modify the necessary scripts.
For Julia: rungdca.py
and runplm.py
For CD-HIT: alignmentstats.py
git checkout https://github.com/mskwark/PconsC3.git
Before the first run, download and unpack the trained Random Forests in the same directory as the PconsC3 code. You should have six subdirectories named tforest0, tforest1,...tforest5
. You can get them from Google Drive or a local mirror (378MiB). If you just want to give PconsC3 a try or for some other reason need a smaller archive, feel free to download the mini-version either from Google Drive or local mirror (38MB), being advised that this version may not perform as well as the fully-fledged one (but will be roughly 10x faster!).
> tar -xJf pconsc3-forests.tar.xz
You may want to put them on a fast filesystem (on a relatively recent Linux machine /dev/shm/
is a good choice and by default PconsC3 will look for them there (i.e. it will check if /dev/shm/tforest0
etc. exist and are sane). As a fallback it will look in the same directory ./predict.py
is located. If you want to change it, you need to modify forestlocation
variable in the head of ./predict.py
.
To run PconsC3, you need to have at hand:
You can name these files any way you want, but assuming your alignment is named myprotein.fas
, your contact priors are named external.RR
, secondary structure prediction file is named psipred.ss2
and RSA is named netsurf.rsa
, to run the prediction do the following.
Infer evolutionary couplings with GaussDCA:
./rungdca.py myprotein.fas
It will produce a file named myprotein.gdca
Infer evolutionary couplings with plmDCA.jl:
./runplm.py myprotein.fas
It will produce a file named myprotein.0.02.plm20
Compute alignment statistics:
./alignmentstats.py myprotein.fas
It will produce a file named myprotein.stats
Run PconsC3:
./predict.py myprotein.gdca myprotein.0.02.plm20 external.RR netsurf.rsa psipred.ss2 myprotein.stats myprotein.fas outputfile
This will run for a while, but will provide you with estimates of running time. It will result in a number of intermediate files being generated: outputfile.l0, outputfile.l1...outputfile.l5
and an outputfile.RR
containing final predictions in RR format (by default only non-local prediction are output).
The parallel version with HDF5 support drastically reduces IO and computation time, while not changing the output in any way. To set it up make sure h5py and Cython are in your PYTHONPATH. You can install the packages via pip:
pip install h5py
pip install Cython
Then you need to convert the forest data in your PconsC3 root directory into HDF5-files:
cd <PconsC3 root directory>
python convert_to_hdf5.py .
After successful conversion you can safely remove the folders containing the forest data:
find tlayer* ! -name '*.hdf5' -type d -exec rm -r {} +
And finally compile the Cython script:
python setup.py build_ext -i
After that you can run the fast version of PconsC3:
./predict-parallel-hdf5.py myprotein.gdca myprotein.0.02.plm20 external.RR netsurf.rsa psipred.ss2 myprotein.stats myprotein.fas outputfile [NumberThreads]
There are a few parameters in ./predict.py
that can be tweaked, notably:
maxtime
-- the maximum time in seconds spent on a single prediction layer (total prediction time will be at most 6x maxtime + time spent on i/o). For the larger proteins setting maxtime too low may result in sub-par performancetreefraction
-- the fraction of trees to be used. While we recommend leaving treefraction
set to 1., benchmarks have demonstrated satisfactory performance at values as low as 0.3
and for proteins with a lot of sequence information, as low as 0.1
. If you run into any problems with the software or observe it performing poorer than expected, we would appreciate an email to Marcin J. Skwark (firstname@lastname.pl or firstname.middleinitial.lastname@vanderbilt.edu).