koszullab / instaGRAAL

Large genome reassembly based on Hi-C data, continuation of GRAAL
https://research.pasteur.fr/fr/software/graal-software-for-genome-assembly-from-chromosome-contact-frequencies/
GNU General Public License v3.0
40 stars 9 forks source link
3c genome-assembly genomics hi-c scaffolding

instaGRAAL

PyPI version PyPI - Python Version Docker Cloud Automated build Read the docs DOI License: GPLv3 Code style: black

Large genome reassembly based on Hi-C data (continuation and partial rewrite of GRAAL; Marie-Nelly et al., 2013; 2014) and post-scaffolding polishing libraries.

This work is under continuous development/improvement - see GRAAL for information about the basic principles.

You can now easily install instaGRAAL using a docker container available below or you can try it on Galaxy Europe.

Table of contents

Installation

Install from PyPI:

    sudo pip3 install -U instagraal

or, if you want to get the very latest version:

   sudo pip3 install -e git+https://github.com/koszullab/instagraal.git@master#egg=instagraal

This should automatically handle most dependencies.

Requirements

The scaffolder and polishing libraries are written in Python 3 and CUDA. As such, an NVIDIA graphics card is required for the scaffolder to run. The Python 2 version is available at the python2 branch of this repository, but be aware that development will mainly focus on the Python 3 version. The software has been tested for Ubuntu 17.04 and later, and most dependencies can be downloaded with its package manager (or Python's pip).

External libraries

You will need to download and install the NVIDIA CUDA toolkit. Manual installation is recommended - installing nvidia-cuda-toolkit from Ubuntu's package manager has been known to cause glitches. It is fairly straightforward on OS X thanks to the installation wizard. Here is how to quickly do it on Ubuntu 18.04:

    wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
    chmod +x cuda_10.0.130_410.48_linux
    sudo ./cuda_10.0.130_410.48_linux

Note to Ubuntu users: Be aware that the installation script will fail if it isn't run as root, or if a graphical instance (e.g. X) is running as well. You may need to temporarily shut it down, for instance by switching to tty1 and running the following (prior to the installation script):

    sudo service lightdm stop

(Replace lightdm with mdm, gdm or whichever login manager is present on your machine if that fails; if all else fails as well, you may have to run something like sudo pkill Xorg instead.)

Note to OS X users: There is currently no CUDA support on Mojave (10.14) and it is unclear when it is going to be added, if it is to be added at all. This means instaGRAAL (or indeed any CUDA-based application) will not work on Mojave. If you wish to run it on OS X, the only solution for now is to downgrade to High Sierra (10.13).

Recommended libraries

Because some Python dependencies (such as pyopengl or h5py) require to be built against specific files, it is recommended that you install the following packages if you encounter errors.

OpenGL libraries
HDF5 serialization library
Boost libraries

Python dependencies

Python package requirements should be handled automatically by pip, but should you wish to install them manually, these are:

They can also be handily installed using the supplied requirements file in the repo:

pip3 install -Ur requirements.txt

You will also need to build pycuda with OpenGL support and disable its use of custom Boost libraries. Installing it directly from PyPI will cause errors at runtime. Here is how to do it manually with Git on Ubuntu or OS X:

    git clone --recurse-submodules https://github.com/inducer/pycuda.git
    cd pycuda
    python3 configure.py --cuda-enable-gl --no-use-shipped-boost
    sudo python3 setup.py install

You may run (as root) instagraal-setup, an all-in-one script to handle all the above dependencies on Ubuntu 17+.

Container

There is experimental Docker support for instaGRAAL. You may fetch the corresponding image by running the following:

    docker pull koszullab/instagraal

And run it with

docker run --gpus all koszullab/instagraal

Note: Running the container requires the dependency nvidia-docker2 [installation]

Usage

Unlike GRAAL, this is meant to be run from the command line.

instagraal <hic_folder> <reference.fa> [<output_folder>]
           [--level=4] [--cycles=100] [--coverage-std=1]
           [--neighborhood=5] [--device=0] [--circular] [--bomb]
           [--save-matrix] [--pyramid-only] [--save-pickle] [--simple]
           [--quiet] [--debug]

Options

-h, --help              Display this help message.
--version               Display the program's current version.
-l 4, --level 4         Level (resolution) of the contact map.
                        Increasing level by one means a threefold smaller
                        resolution but also a threefold faster computation
                        time. [default: 4]
-n 100, --cycles 100    Number of iterations to perform for each bin.
                        (row/column of the contact map). A high number of
                        cycles has diminishing returns but there is a
                        necessary minimum for assembly convergence.
                        [default: 100]
-c 1, --coverage-std 1  Number of standard deviations below the mean.
                        coverage, below which fragments should be filtered
                        out prior to binning. [default: 1]
-N 5, --neighborhood 5  Number of neighbors to sample for potential
                        mutations for each bin. [default: 5]
--device 0              If multiple graphic cards are available, select
                        a specific device (numbered from 0). [default: 0]
-C, --circular          Indicates genome is circular. [default: False]
-b, --bomb              Explode the genome prior to scaffolding.
                        [default: False]
--pyramid-only          Only build multi-resolution contact maps (pyramids)
                        and don't do any scaffolding. [default: False]
--save-pickle           Dump all info from the instaGRAAL run into a
                        pickle. Primarily for development purposes, but
                        also for advanced post hoc introspection.
                        [default: False]
--save-matrix           Saves a preview of the contact map after each
                        cycle, in csv format. [default: False]
--simple                Only perform operations at the edge of the contigs.
                        [default: False]
--quiet                 Only display warnings and errors as outputs.
                        [default: False]
--debug                 Display debug information. For development purposes
                        only. Mutually exclusive with --quiet, and will
                        override it. [default: False]

Input datasets

Format specification

The above <hic_folder> passed as an argument to instaGRAAL needs three files:

All fields (including those in the files' headers) must be separated by tabs.

Minimal working templates are provided in the example folder.

Matrix generation

If you want to generate instaGRAAL-compatible matrices from scratch (i.e. from reads and a reference genome, as opposed to existing Hi-C data in one of the numerous existing formats), you may do so with hicstuff, which acts as both a Python library and a pipeline. Instructions, parameters and optional arguments are detailed in the repo's readme. We strongly recommend using hicstuff with the parameter -m iterative or -m cutsite to improve mapping.

Output

After the scaffolder is done running, whatever path you specified as output will contain a test_mcmc_X directory, where X is the level (resolution) at which scaffolding was performed. This directory, in turn, will contain the following:

Other files are mostly for developmental purposes and keep track of the evolution of various metrics and model parameters.

Curation

This step is strongly recommended to improve the quality of your scaffolds, unless your input contigs have many misassemblies. Lingering artifacts found in output genomes can be corrected by editing the info_frags.txt file, either by hand or with a script. Look at options by running the following:

instagraal-polish -h

The most common use case is to run all curation procedures at once:

instagraal-polish -m polishing -i info_frags.txt -f contigs.fasta -o curated_assembly.fa

You can add gaps with the parameter -j (necessary for subsequent gap filling), for instance gaps with 10 Ns in this example:

instagraal-polish -m polishing -i info_frags.txt -f contigs.fasta -o curated_assembly.fa -j NNNNNNNNNN

Troubleshooting

"I am not happy with the scaffolds"

If the output is not as you would expect:

Scaffolding is too slow

By default, the parameter --level is set to 4. For genomes larger than 500 Mb, increasing it to 5 is often more adapted to improve runtime, and 6 for genomes larger than 3 Gb.

KeyError on contig names

This is due to spaces and special characters in contig names. Check that the contig names match the ones in the outputs from hicstuff, and rename your contigs if necessary.

Loading CUDA libraries

If you encounter the following error, despite having installed the NVIDIA CUDA Toolkit:

ImportError: libcurand.so.9.2: cannot open shared object file: No such file or directory

it probably means the CUDA-related libraries haven't been properly added to your $PATH for some reason. A quick solution is to simply add this at the end of your .bashrc or .bash_profile (replace the paths with wherever you installed the toolkit and change the version number accordingly):

export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Remote running

If you encounter the following error:

freeglut (instagraal.py): failed to open display ''

it most likely means you attempted to run an instaGRAAL instance remotely (e.g. over ssh) but didn't configure a proper $DISPLAY variable. In order to avoid this, simply run the following beforehand:

export DISPLAY=:0

Note that this will disable the movie (it will play on the remote machine instead).

However, instaGRAAL is based on OpenGL, which means there has to be an X server of some kind running on your target machine no matter what. While this allows for pretty movies and visualizations, it may prove problematic on an environment you don't have total control over, e.g. a server cluster. Currently, your best bet is asking the system administrator of the target machine to set up an X instance (possibly virtual, such as Xvfb or xserver-xorg-video-dummy) if they haven't already.

PyOpenGL/GLUT error

If you encounter the following:

NullFunctionError: Attempt to call an undefined function glutInit, check for bool(glutInit) before calling

check whether you have installed freeglut3-dev. It seems that the pyopengl library does not include a GLUT implementation when installed from PyPI. Alternatively, just installing pyopengl with your package manager (e.g. python3-pyopengl on Ubuntu) seems to work as well.

Codepy toolchain

If you encounter an error like the following :

  File "/usr/local/lib/python3.6/dist-packages/codepy/toolchain.py", line 382, in _guess_toolchain_kwargs_from_python_config
object_suffix = '.' + make_vars['MODOBJS'].split()[0].split('.')[1]
IndexError: list index out of range

You may need to upgrade to a more recent version of codepy.

    sudo pip3 install --upgrade --no-cache-dir -e git+https://github.com/inducer/codepy.git@master#egg=codepy

No such error has been found as of commit 10a014f, so if you encounter regressions after this, you should stick to that version.

Depending on your system, you may also need to upgrade to gcc/g++ 8:

    sudo apt install gcc-8 g++-8

If for some reason your system does not automatically switch to gcc/g++-8, you should manually configure your system to do so, e.g. on Ubuntu:

    sudo update-alternatives --remove-all gcc 
    sudo update-alternatives --remove-all g++

    sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 10
    sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 10

General tips

Documentation

As a Python package, instaGRAAL provides both a scaffolding and polishing library, as well as a convenient Hi-C matrix handling framework, and we've tried to expose much of the API behind these on readthedocs. If you wish to know more about how the scaffolder works, see the references, especially the supplementary method delving deeper into the details of the model.

References

Principle

Use cases

Contact