heartnetkung / XT-neighbor

MIT License
3 stars 0 forks source link

Description

XTNeighbor is a fast scalable method for nearest neighbor search of adaptive immune receptors (AIRs) using GPU. In simple terms, our inputs are CDR3 regions of AIRs represented as a string of amino acids and the algorithm finds all pairs of AIRs such that their similarity is within a specified Levenshtein distance threshold. XTNeighbor is orders of magnitude faster than current methods thanks to a symmetric deletion algorithmic approach, GPU acceleration, and memory optimization. A detailed description of the method is provide in our arXiv preprint.

Quick Usage

This is the Google Colab Notebook that allows user to quickly use this tool from web browser without setting up the GPU environment.

Installation

XTNeighbor has been tested with the following environment:

Detailed installation instructions, examples, and testing code are provided via a Google Colab demo.

For advanced tutorial in compiling XT-neighbor on bare-bone Linux, read this tutorial.

Usage

xt_neighbor: perform either nearest neighbor search for CDR3 sequences or immune repertoire overlap using GPU-based xt_neighbor algorithm.
    ====================
     Common Options
    ====================
     -d or --distance [number]: distance threshold defining the neighbor (default to 1)
     -o or --output-path [str]: path of the output file (default to no output)
     -m or --measurement [leven|hamming]: distance measurement (default to leven)
     -v or --version: print the version of the program then exit
     -h or --help: print the help text of the program then exit
     -V or --verbose: print extra detail as the program runs for debugging purpose
     -a or --airr: use AIRR format for input-path instead. Relevant fields are cdr3_aa and duplicate_count
    ====================
     Nearest Neighbor Options
    ====================
     -i or --input-path [str] (required): path of csv input file containing exactly 1 column: CDR3 amino acid sequences
     -n or --input-length [number] (required): number of rows given in the input file
    ====================
     Repertoire Overlap Options
    ====================
     -i or --input-path [str] (required): path of csv input file containing exactly 2 columns: CDR3 amino acid sequences and their frequency. Note that the sequences are assumed to be unique
     -n or --input-length [number] (required): number of sequences given in the input file
     -I or --info-path [str] (required): path of csv input file containing exactly 1 column: repertoire sizes. Note that the order of input sequence must be sorted according to this repertoire info
     -N or --info-length [number] (required): number of repertoires given in the info file

Benchmarking and Reproducibility

Deduplication Warning

Documentation

Note on 1.0 and 2.0 Version of the Algorithm

FAQ

Citation

@misc{chotisorayuth2024lightningfast,
      title={Lightning-fast adaptive immune receptor similarity search by symmetric deletion lookup}, 
      author={Touchchai Chotisorayuth and Andreas Tiffeau-Mayer},
      year={2024},
      eprint={2403.09010},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM}
}