matsengrp / cft

Clonal family tree
5 stars 3 forks source link

CFT: Clonal Family Tree

A pipeline for producing clonal family trees and ancestral state reconstructions using partis output.

Output data can be run through cftweb for visualization and exploration.

Note

While this package is very useful to us, the documentation and useability are not really sufficient for widespread use. We are instead making this repository publicly available to help with openness and reproducibility for the papers in which we've used it. That said, if you'd be interested in running it and are having trouble getting it working, please submit an issue and we'd be happy to help.

Input data

For each sample you'd like to process, cft needs to know:

CFT requires that you organize this information in a dataset file as follows:

# The dataset-id identifies the collection of samples in downstream organization
id: laura-mb-v14
samples:
  # each sample must be keyed by a per-dataset unique identifier
  Hs-LN-D-5RACE-IgG:
    locus: igh
    parameter-dir: /path/to/Hs-LN-D-5RACE-IgG/parameter-dir
    per-sequence-meta-file: /path/to/Hs-LN-D-5RACE-IgG/seqmeta.csv

    # Unseeded partitions go here
    partition-file: /path/to/Hs-LN-D-5RACE-IgG/partition.csv

    # seed partition runs should be organized under a `seeds` key as follows
    seeds:
      # seed sequence id
      BF520.1-igh:
        partition-file: /path/to/Hs-LN-D-5RACE-IgG/partition.csv
      # other seeds, as applicable...

  # another sample in our dataset...
  Hs-LN-D-5RACE-IgK:
    # etc.

  # etc.

Some notes about this:

A more fleshed out example, as well as json examples, and python snippets can be seen on the wiki.

You may also wish to take a look at bin/dataset_utils.py, a little utility script for filtering and merging dataset files. You can get a comprehensive help menu by running bin/dataset_utils.py -h. You may also wish to directly use the script which does the initial extraction of data from partis, bin/process_partis.py.

Note that in order for the data to process correctly, the following must be true of the naming scheme for sequences:

Final note: that if your partition-file is a CSV file, you will also need to keep around the corresponding *-cluster-annotations.csv CSV generated by partis, and make sure its in the same directory as the parition-file, and named to match (if partition-file: partition.csv, then the cluster annotation file should be at partition-cluster-annotation.csv).

Running the pipeline

Note: Before you run the pipeline, you must follow the build environment setup section below.

CFT uses the scons build tool to execute the build process. Running scons from within the cft checkout directory loads the SConstruct file, which specifies how data is to be processed:

Running scons without modifying the SConstruct will run default tests on the partis output in tests/. To check that the output thereby produced matches the expected test output, run diff -ubr --exclude='*metadata.json' tests/test-output output

This particular SConstruct takes several command line parameters. Below are the most frequently used options, which must include = in the format --option=value:

A separate "dataset" directory and corresponding metadata.json file will be created for each infile and placed within the output directory, organized by the id attribute of the dataset infile. For the most complete and up to date reference on these, look at the tail Local Options section of scons -h.

You may also wish to take note of the following basic scons build options options:

In general, it's good to run with -k so that on a first pass, you end up building as much of the data as you can properly build. If there are errors, try rerunning to make sure the problem isn't just an errant memory issue on your cluster, then look back at the logs and see if you can't debug the issue. If it's just a few clusters failing to build properly and you don't want to hold out on getting the rest of the built data into cftweb, you can rerun the build with -i, which will take a little longer to run through all of the failed build branches with missing files etc, but which should successfully compile the final output metadata.json files necessary for passing along to cftweb.

Typical example usage

# If you're using conda, as below, first activate the environment
source activate cft

# Build the data, running 12 jobs at a time (parallelism) and appending all stdout/stderr to a log file
scons --infiles=info1.yaml:info2.yaml -k -j 12 --debug explain &>> 2018-05-24.info1-build.log

# You can watch a live tail of the log file from another terminal window or tmux pane with
tail -f 2018-05-24.info1-build.log

# Once its done running, you can take a look at the output
tree output
# or if you don't have tree
find output

Note that you can install tree with sudo apt-get install tree on Ubuntu for a nice ASCII-art file tree display of the output contents.

Setting up the environment

  1. Install conda.
  2. Run conda create -y -c bioconda -c conda-forge --name cft --file requirements.txt.
  3. Activate the environment.
  4. Make sure you have cloned the git submodules (see below).
  5. Follow instructions below for submodules and slurm.
  6. Install the partis submodule.

Git submodules

Finally, there is some python code needed for the build script to execute which can be found in a number of git submodules. In particular, this repository has a partis submodule which should be kept in sync to avoid build issues.

  1. Check out these submodules: execute git submodule init then git submodule update.
  2. Set the PARTIS env variable: run export PARTIS=/path/to/cft/partis using the path to your submodule install or another install of partis.

More info on how to use git submodules here.

Using Slurm

The build pipeline is set up to use slurm for job submission on a number of the more compute heavy, long-running tasks. If you have a slurm environment set up to submit to a cluster, and are able to write from slurm nodes to a shared filesystem, you can potentially run with significantly higher parallelism than you would be able to on a single computer.

If you are running on Fredhutch's servers, this should all be set up for you, and you should be able to submit upwards of 50-70 jobs using the -j flag, as specified below. If you're not at the Hutch, setting up such a cluster is way out of scope for this document, but if you're inspired, good luck figuring it out!

Visualization

Once the data is built, you can consume the fruits of this labor by passing the data off to Olmsted.