phbradley / tcr-dist

Software tools for the analysis of epitope-specific T cell receptor (TCR) repertoires (scroll down for the README)
MIT License
79 stars 36 forks source link

TCRdist pipeline, version 0.0.2 beta

Thanks for your interest in TCRdist! This software is very much a work in progress, so we welcome any/all feedback and will work to add features and fix bugs as quickly as we can.

If you would like to contribute to the project on github, feel free to contact us.

Phil Bradley (pbradley@fredhutch.org) Jeremy Chase Crawford (Jeremy.Crawford@stjude.org)

See LICENSE.txt for details on the (MIT) license.


REQUIREMENTS

System:

Python version

Package dependencies

External command line tools

NCBI Blast


INSTALLATION

1) Go to the tcr-dist/ directory (main code directory)

2) run the command:

     python setup.py

3) cross your fingers.

There are some potentially useful comments at the top of setup.py


USAGE

For an overview of what the analysis is supposed to do, consult the publication listed below in the "CITING" section.

The basic workflow starting from a sequence file would be to run

python run_basic_analysis.py --organism <organism> --pair_seqs_file <filename>

where <organism> is either mouse or human, and <filename> is the name of a .tsv (tab-separated values) file with the following fields:

id  epitope subject a_nucseq    b_nucseq    a_quals b_quals

id: Unique identifier for the TCR

epitope: The name of the epitope to which the TCR binds

subject: The individual from whom the TCR was sampled, necessary for identifying clones, and analyzing repertoire privacy.

a_nucseq, b_nucseq: The nucleotide sequences of the TCR alpha and beta chain reads

a_quals, b_quals: Read quality information in the form of '.'-separated lists of the quality scores for the corresponding nucleotide sequences (a_nucseq and b_nucseq), used for read quality filtering (e.g., 25.32.35.36.45.45.36)

The id, a_quals, and b_quals fields can be omitted; see the help message printed by running the command:

python run_basic_analysis.py -h

The results of the analysis are summarized in an html file and associated directory of images which will be generated by the script. The default name (which can be changed with the --webdir option) will look something like <filename>*web/index.html

You can also run individual steps in the analysis from the command line. Running a script with the -h option will print a very basic help message listing the command line arguments. We are currently working hard to flesh out the help messages -- our apologies for the lack of clarity. Some scripts depend on the output of previous steps (for example compute_distances.py generates distance matrices which are used by downstream programs). The source code for run_basic_analysis.py gives an example of an appropriate order for calling the scripts.


TESTING

(After running setup.py)

In the tcr-dist/ directory you can run the command

python create_and_run_testing_datasets.py

which will create a directory called testing/, make two small dataset files there, and run the analysis pipeline on them.

Running setup.py should have created a directory called testing_ref/ which should contain examples of the *_web/index.html output generated by the pipeline. So if you compare the results files that correspond between

testing/*web/index.html

and

testing_ref/*web/index.html

they should look pretty similar. You can also search for the text "missing" in those html results to see if any of the results files are missing from the testing/ version but present in the testing_ref/ version.

You can also re-run the analysis in the original paper by typing:

python rerun_paper_analysis.py

which may take a couple of hours. Add the --from_pair_seqs option to restart from nucleotide sequences. And/or the --multicore option to let the pipeline spawn multiple, independent processes.


CITING

Quantifiable predictive features define epitope-specific T cell receptor repertoires

Pradyot Dash, Andrew J. Fiore-Gartland, Tomer Hertz, George C. Wang, Shalini Sharma, Aisha Souquette, Jeremy Chase Crawford, E. Bridie Clemens, Thi H. O. Nguyen, Katherine Kedzierska, Nicole L. La Gruta, Philip Bradley & Paul G. Thomas

Nature (2017) doi:10.1038/nature22383


THANKS

Part of this analysis uses parameters derived from nextgen data in publicly released studies. We are grateful to the authors of those studies for making their data available. See the README_db.txt file in ./db/ (after running setup.py)

The code uses the command line parsing toolkit blargs. See the license and info in external/blargs/

The tables in the .html results can be sorted thanks to tablesorter. See the license and info in external/tablesorter/

Sequence parsing relies on the BLAST suite, see info in external/blast-2.2.16/


UPDATES

Version 0.0.2 (09/21/2017):