immunomind / immunarch

🧬 Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
297 stars 65 forks source link

Bcell lineage analysis #150

Open fabio-t opened 3 years ago

fabio-t commented 3 years ago

🚀 Feature

Allow plotting of "clonal trees" for bcell data, rooted on the germline.

Motivation

B-cell repertoires are interesting because of SHM - you can group together different clonotypes into a "clonal tree", showing distance from the germline and also intermediate nodes. Most tools for clonal identification will not properly do this, which is a pity. Immcantation has lineage analysis built-in, while MIXCR/MIGEC don't.

Additional context

https://alakazam.readthedocs.io/en/stable/ https://www.frontiersin.org/articles/10.3389/fimmu.2018.02149/full https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007977

vadimnazarov commented 3 years ago

Hello Fabio,

Thank you for this amazing suggestion! Could you please guide me to the details on the issue that "most tools for clonal identification will not properly do this"? Is there a list of issues and/or a description how to properly do that? Our team is looking into BCR at the moment, and your help would be greatly appreciated.

fabio-t commented 3 years ago

Hi,

There's not too much to say I'm afraid :) but I'll try to clarify my position.

One of the public pipelines actually doing lineage analysis is Immcantation, but I can't make it work with my data and, in general, would prefer to keep using the tool stack I've been using for years (MIGEC+MIXCR). VDJTools, developed by the same group of those two, doesn't support lineage analysis either.

I find plenty of papers but no unified, clear pipeline able to produce nice clonal trees from the output of mixcr. Currently I'm trying to write one on my own, but this kind of effort is better invested into public tools like immunarch rather than hacked-together in-house tools.

In very simple terms, a B-cell lineage tool should do the following:

1) take the clonal assignment output (ie, the mixcr output file) and extract V and J gene hits, CDR3 nucleotide sequence, and the V and J alignment string

2) the germline has to be put as root of the tree, and the clonal alignments need to be used to generate a distance matrix (between each other, and with the germline)

3) a proper lineage model identifies both leaves and internal nodes based on CDR3 length and similarity (distance between clones) as well as distance from the germline (the higher the distance, the further away from the tree)

4) the plotting should be fairly flexible, allowing a few options: from the simplest, classic thin-branch phylogenetic tree plot to a more complex circular plot with colour and metadata (for example, one could want to visualise the aaCDR3 sequence in the tree, for visual inspection of how that b-cell evolution affected the actual receptor; also forests, not just trees, should be supported in plotting because we usually look at whole repertoires, not single clones).

The devil is in the details of course - which mutation model to use, what threshold of CDR3 similarity, whether to allow insertion and deletions.. it's quite a complex problem to solve, but a basic implementation could be improved upon over time.

I've found immunarch extremely pleasant to work with for T-cell analysis (to the point that I'm slowly converting all my downstream scripts/plots to immunarch). So it would be great if I could leverage immunarch flexibility for this.

vadimnazarov commented 3 years ago

Hi Fabio,

Thank you so much for a such detailed response, I truly appreciate it. Do you have good publications in mind which could serve as examples of visualisations (especially the "forest" ones) and models / thresholds / indel allowance / etc.? We would like to go with basic implementations like you said, but it would be great to develop a basic implementation that is usable in the most cases if possible.

Again, thank you for your support. We have several people working on this topic, and any help would greatly accelerate the development. We would be glad to send you a pre-release version much earlier than package release to clear up things and see whether the approach we choose works as intended, and make the changes accordingly.

fabio-t commented 3 years ago

Thanks a lot for your answer - and I'll be very glad to beta-test it on our data.

For forests, I think it would be awesome to have a network diagram like this: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4839220/figure/msw015-F2/?report=objectonly This gives a visual, impacting view of the Bcell repertoire(s) being studied, and can also make the concept of "diversity" more human-readable than classic metrics.

For single lineages, something like this would be preferable: https://link.springer.com/article/10.1007/s10875-006-9056-9/figures/5 with the possibility of specifying the text to print for each node. It's usually a good idea to differentiate visually between leaf nodes ("final" clone in a lineage at time of sequencing), observed internal nodes and inferred internal nodes. But it's not absolutely necessary.

Further readings in addition to those included above:

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-06936-w https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0243-2

These should give you enough information to be able to produce a baseline implementation.

fabio-t commented 3 years ago

One first step I'm currently doing, is move from "clones" as defined, for example, by a mixcr pipeline, to "clonal groups" (from which you could then further build lineage trees). I'm putting here some very simple code to illustrate it, feel free to use/modify it if you want.

This groups the original "clones" by V and J identity, then by aaCDR3 similarity (85% in the example, but can be further tweaked). It then creates a consensus sequence of the ntCDR3 and aaCDR3 for each clonal group, and the output tibble could probably be just plugged into the rest of immunarch's procedures to do repertoire analsysis.

It's just a very simplified example, that maybe could be of some use to you.

library(Biostrings)
library(ape)
library(stringr)

consensus_nt <- function(x){consensusString(DNAStringSet(x))}
consensus_aa <- function(x){consensusString(AAStringSet(x))}

# very simple approach - hamming distance between amino-acid sequences of CDR3 and
# UPGMA clustering followed by cutting the tree at a 15% dissimilarity (default)
cloneclust <- function(x, h=0.15) {
  if (length(x) == 1){return(1)}

  aa_seqs <- as.matrix(as.AAbin(AAStringSet(x)))
  seq_dist <- dist.aa(aa_seqs, scaled=T)
  cl <- hclust(seq_dist, method="average")
  cl$height = round(cl$height, digits=8)
  cutree(cl, h=h)
}

d <- immdata$data$Sample1

d %>%
  mutate(len=str_length(CDR3.nt)) %>%
  group_by(V.name, J.name, len) %>%
  mutate(cluster=cloneclust(CDR3.aa, 0.85)) %>%
  summarise(n=n(), Clones=sum(Clones), Proportion=sum(Proportion), CDR3.nt=consensus_nt(CDR3.nt), CDR3.aa=consensus_aa(CDR3.aa)) %>%
  arrange(desc(Clones)) %>%
  as.data.frame()

For lineage trees, one would instead need the sequence from FR1 (start) to FR3 (end), so the V region just before the CDR3 (or in some cases, including the first ~6 bps of the CDR3). This seems to be enough to build decent trees, when rooted with the corresponding germline of the V-hit.

Here is an example from my data, rooted on the germline. You can see both the alignments and the final tree here: http://www.phylogeny.fr/simple_phylogeny.cgi?workflow_id=b02e40313c3ca8c09979a40198a3abd8&tab_index=6

tsvvas commented 2 years ago

Hi Fabio,

My name is Vasily, I am one of the core team members maintaining immunarch. Currently, we are actively researching and implementing clonal lineage analysis of BCR receptors in immunarch.

As you are an active user of immunarch working with BCR analysis, we would like to have a call with you to discuss your inquiry in more details. Could you write an email to contact@immunomind.io to schedule a call with our research team?

This will help us greatly to implement requested features and the tutorial that will guide users of immunarch who analyse BCR receptor sequencing data.

Best, Vasily

fabio-t commented 2 years ago

For sure, thank you @tsvvas.