immunomind / immunarch

🧬 Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
306 stars 65 forks source link

dbAnnotate for similar CDR3 (e.g. Levenshtein distance) #130

Open marcopretti opened 3 years ago

marcopretti commented 3 years ago

🚀 Feature

Hi guys, I'm new to github, so not sure how/where to post things. I believe a function that uses Levenshtein distance or others would be really useful for the package

Motivation

Some online databases offer the search for CDR3 sequences allowing one or more substitutions in the sequence. I like the usability of the dbAnnotate function, but I needed one that allowed substitutions/mismatches.

Alternatives

For that reason, I wrote this small piece of code below. I'm not an expert in R and sure the code can be improved. I'm sharing so maybe you could implement something similar.

dbAnnotate.dist <- function(query.db, db, dist=2, data.col="CDR3.aa", db.cdr3="cdr3", db.col="cdr3", method="lv", ...){
  require(stringdist, dplyr)
  print(paste("Using",method, "distance <=",dist))
  print("See ?stringdistmatrix for all available methods")

  db.filt <- lapply(query.db, function(x){
    query=x %>% pull(data.col) # get query CDR3 sequences
    db.seq=db %>% pull(db.cdr3) # get db CDR3 sequences

    adist <- stringdistmatrix(query, db.seq, method = method) # distance search
    ind=which(adist <= dist, arr.ind = T) # filter distance matrix

    df=data.table(query=query[ind[,1]], # get original CDR3
                  db=db.seq[ind[,2]], # get db CDR3
                  dist=adist[ind]) # get the distance
    colnames(df)[1:2] = c(data.col, db.cdr3)

    db.filt=db[(db %>% pull(db.cdr3)) %in% (df %>% pull(db.cdr3)),] # annotate the CDR3 sequences with the database
    db.filt=full_join(df, db.filt) # merge with the distance annotations
    return(db.filt)
  })
}

# you can test using
vdjdb.ann=dbAnnotate.dist(immdata_10x.p.filt$data, head(vdjdb,1000), dist = 3)
mcpas.ann=dbAnnotate.dist(immdata$data, head(mcpas,1000), dist = 3, db.cdr3 = "CDR3.beta.aa")
tbadb.ann=dbAnnotate.dist(immdata$data, head(tbadb,1000), dist = 3, db.cdr3 = "CDR3.beta.aa")
vadimnazarov commented 3 years ago

Note to myself: probably related to #131

vadimnazarov commented 3 years ago

Hello @Marco0107 and thank you so much for signing up on GitHub to open an issue, I truly appreciate it!

I'm deeply sorry for the very late response. We've been extremely busy raising investments and hiring the team after finishing UC Berkeley startup accelerator. We are finally set up for now: we grew the team from 4 to 15 amazing professionals, and raised a large investment round from great investors. Thank you so much for your patience!

We are thinking about CDR3-based clustering and fuzzy search, but we need more use cases and applications for it. Would you like to search using amino acid only, or do you consider something like BLOSUM-based alignment?

Improving upon Immunarch, we will soon release our online analytics platform for fast, effortless and code-less reproduction of impactul single-cell transcriptomics and immune repertoire publications. It's designed for researchers who spend a lot of time or energy reproducing sophisticated pipelines or analysing datasets from publications. Feel free to send us a link to publications of your interest so we can reproduce them on our platform and give you an access.

We are more than happy to grant you an early access as our gratitude for you being a longstanding immunarch user.

We have a list of datasets from around 30 publications that we are going to make available for your exploration and analysis. As an example, we recently reproduced on a platform a paper focused on scRNAseq-based estimation of efficacy of CAR-T immunotherapy.

Please send links to support@immunomind.io. The direct link to the publication datasets would be also greatly appreciated.

marcopretti commented 3 years ago

Don't worry @vadimnazarov and thank you for your response.

I consider search for amino acid only since really similar sequences could be recognizing the same epitope. BLOSUM-based alignment wouldn't be of interest (mine at least). I believe more than 3-4 divergent amino acids from the query sequence would be too much (to be discutted, depends on the position). However, CDR3-based clustering sounds really interesting..

I'm happy to see you are improving the tool and I can't wait to see it ;)