Thanks for your interest in TCRdist! This software is very much a work in progress, so we welcome any/all feedback and will work to add features and fix bugs as quickly as we can.
If you would like to contribute to the project on github, feel free to contact us.
Phil Bradley (pbradley@fredhutch.org) Jeremy Chase Crawford (Jeremy.Crawford@stjude.org)
See LICENSE.txt for details on the (MIT) license.
I use anaconda to keep these updated on my system; I think pip will also work, although it sounds like mixing and matching anaconda and pip can be problematic.
scipy: https://www.scipy.org/install.html tested with version 0.16.0
scikit-learn: aka "sklearn" for KernelPCA, adjusted_mutual_info_score http://scikit-learn.org/stable/install.html tested with version 0.17
matplotlib: tested with version 1.4.3
numpy: tested with version 1.10.1
convert: (or rsvg-convert or inkscape)
from Imagemagick is used to convert svg files to png files. If you have an alternative you can modify the function "convert_svg_to_png" in basic.py
wget: (or curl)
for downloading database and other files
If you have something else that works similarly on your system, feel free to modify setup.py
or contact me to add that as an option.
1) Go to the tcr-dist/ directory (main code directory)
2) run the command:
python setup.py
3) cross your fingers.
There are some potentially useful comments at the top of setup.py
For an overview of what the analysis is supposed to do, consult the publication listed below in the "CITING" section.
The basic workflow starting from a sequence file would be to run
python run_basic_analysis.py --organism <organism> --pair_seqs_file <filename>
where <organism>
is either mouse or human, and <filename>
is the name of a .tsv (tab-separated values) file with the following fields:
id epitope subject a_nucseq b_nucseq a_quals b_quals
id
: Unique identifier for the TCR
epitope
: The name of the epitope to which the TCR binds
subject
: The individual from whom the TCR was sampled, necessary for identifying clones, and analyzing repertoire privacy.
a_nucseq, b_nucseq
: The nucleotide sequences of the TCR alpha and beta chain reads
a_quals, b_quals
: Read quality information in the form of '.'-separated lists of the quality scores for the corresponding nucleotide sequences (a_nucseq and b_nucseq), used for read quality filtering (e.g., 25.32.35.36.45.45.36)
The id
, a_quals
, and b_quals
fields can be omitted; see the help message printed by running the command:
python run_basic_analysis.py -h
The results of the analysis are summarized in an html file and associated directory of images which will be generated by the script. The default name (which can be changed with the --webdir
option) will look something like <filename>*web/index.html
You can also run individual steps in the analysis from the command line. Running a script with the -h
option will print a very basic help message listing the command line arguments. We are currently working hard to flesh out the help messages -- our apologies for the lack of clarity. Some scripts depend on the output of previous steps (for example compute_distances.py
generates distance matrices which are used by downstream programs). The source code for run_basic_analysis.py
gives an example of an appropriate order for calling the scripts.
(After running setup.py
)
In the tcr-dist/
directory you can run the command
python create_and_run_testing_datasets.py
which will create a directory called testing/
, make two small dataset files there, and run the analysis pipeline on them.
Running setup.py
should have created a directory called testing_ref/
which should contain examples of the *_web/index.html
output generated by the pipeline. So if you compare the results files that correspond between
testing/*web/index.html
and
testing_ref/*web/index.html
they should look pretty similar. You can also search for the text "missing" in those html results to see if any of the results files are missing from the testing/
version but present in the testing_ref/
version.
You can also re-run the analysis in the original paper by typing:
python rerun_paper_analysis.py
which may take a couple of hours. Add the --from_pair_seqs
option to restart from nucleotide sequences. And/or the --multicore
option to let the pipeline spawn multiple, independent processes.
Quantifiable predictive features define epitope-specific T cell receptor repertoires
Pradyot Dash, Andrew J. Fiore-Gartland, Tomer Hertz, George C. Wang, Shalini Sharma, Aisha Souquette, Jeremy Chase Crawford, E. Bridie Clemens, Thi H. O. Nguyen, Katherine Kedzierska, Nicole L. La Gruta, Philip Bradley & Paul G. Thomas
Nature (2017) doi:10.1038/nature22383
Part of this analysis uses parameters derived from nextgen data in publicly released studies. We are grateful to the authors of those studies for making their data available. See the README_db.txt
file in ./db/
(after running setup.py
)
The code uses the command line parsing toolkit blargs
. See the license and info in external/blargs/
The tables in the .html results can be sorted thanks to tablesorter
. See the license and info in external/tablesorter/
Sequence parsing relies on the BLAST suite, see info in external/blast-2.2.16/
New sequence database system that makes it easier to work with alternate gene sets
Preliminary support for gamma-delta TCRs: edit the db_file
field of the pipeline_params
dictionary
stored in basic.py
. (This is a temporary hack; likely will move to switching by command line flag sometime soon).
New all_genes
dictionary in all_genes.py
indexed by organism that holds information on all the genes; it's
read from the db_file
pointed to by basic.py
.
With new minor updates to the probability model and default sequence database we're no longer trying to preserve
exact numerical identity with the results from the paper. To get the classic results, you can check out the
version_001
branch on github.