veghp / pyVDJ

V(D)J sequencing data analysis
GNU General Public License v3.0
6 stars 2 forks source link

pyVDJ

project unmaintained

Note: this project is unmaintained. Feel free to fork it or re-use any part of it (license: GPLv3). Alternatives to this program are scirpy and Immunarch.

V(D)J sequencing data analysis

This package adds 10x Genomics V(D)J sequencing data to an AnnData object's .uns part, and also makes annotation columns in .obs. This enables plotting various V(D)J properties and handling mRNA (GEX) and V(D)J sequencing data together.

Install

pip install pyvdj

Install the latest version from Github:

pip install git+https://github.com/veghp/pyVDJ.git

Usage

import pyvdj
adata = pyvdj.load_vdj(samples, adata)
adata = pyvdj.add_obs(adata, obs=['is_clone'])

For a detailed description, see the tutorial.

Details

The package has functions that

Read metrics

The read10xsummary function requires a list of paths to metrics_summary.csv files, and optionally a dictionary of path:samplename. It returns a dataframe of the metrics.

Load V(D)J data

The load_vdj function loads 10x V(D)J sequencing data (filtered_contig_annotations.csv files) into an AnnData object's .uns['pyvdj'] slot, and returns the object. The adata.uns['pyvdj'] slot is a dictionary which has the following elements:

If an anndata object is not supplied, the function returns the dictionary.

Arguments:

Add annotations

The adata.uns['pyvdj']['df'] is a pandas dataframe of the V(D)J data, with two additional columns that contain unique cell barcode and clonotype labels. These are generated using the user-supplied sample names: cellbarcode + '_' + samplename and clonotype + '_' + samplename.

These unique cell names are used to match the V(D)J cells to the AnnData .X cells, using adata.obs['vdj_obs']. The user has to prepare this column using the cell barcodes and the sample names.

The add_obs function can add the following annotations:

Definitions

The above definitions are understood in the context of the sequenced cells.

*As determined by Cell Ranger.

**Note that Cell Ranger v2 does not assign a clonotype id to clonotypes with only 1 clone, but uses ‘None’. Cell Ranger v3 does assign a clonotype id to all cells.

CDR3 specificity

We can retrieve CDR3 amino acid sequences for given clonotypes using

pyvdj.get_spec(adata, clonotypes = [clonotype1_sampleA', 'clonotype3_sampleB'])

which returns a dictionary. This can be used to find specificity in CDR3 databases, such as VDJdb or McPAS-TCR.

Clonotype statistics

We can generate and plot various statistics on clonotypes and diversity.

adata = pyvdj.stats(adata, meta)

This function adds a dictionary of statistics on the VDJ data (adata.uns['pyvdj']['stats'][meta]), grouped by categories in the adata.obs[meta] column. Keys:

Public and private CDR3 sequences

We can find TCR-specificity shared between samples, donors or any other annotation category.

adata = pyvdj.find_clones(adata, sample_dict)

This function returns AnnData with clonotype annotation, where clonotypes shared between 10x samples within donor (organism, individual) are combined to have the same clonotype ID. 'sample_dict' is a dictionary of sample:donor, matching 10x samples (channels, as specified when the 10x VDJ data was loaded) to donors.

CDR3-similarity graph

A set of prototype functions build CDR3-similarity graphs using Levenshtein distances. The nodes are the CDR3 sequences, and edges connect nodes with Levenshtein distance of 1.

cdr3_dict = pyvdj.get_cdr3(adata)  # get CDR3s for each sample
dist = pyvdj.get_dist(cdr3_dict, sample)  # calculate distances (adjacency matrix)
g = pyvdj.graph_cdr3(dist)  # returns an igraph graph object.

This requires the python-Levenshtein and the igraph-python packages.

Versions

The pyVDJ project uses the semantic versioning scheme. The latest release is v0.1.2.

License

pyVDJ is free software, which means the users have the freedom to run, copy, distribute, study, change and improve the software.

For more on this, see the Free Software Foundation website.

Dependencies

The package was originally developed for data made with Cell Ranger v2.1.1 (Chemistry: Single Cell V(D)J; V(D)J reference: GRCh38-alts-ensembl) and has been tested to work with Cell Ranger v3.1.0 data, with the following Python (v3.6.9) package versions:

pandas 0.25.1
anndata 0.6.21
scanpy 1.4.3