rnabioco / djvdj

An R package to analyze single-cell V(D)J data
https://rnabioco.github.io/djvdj
Other
23 stars 4 forks source link

Clone similarity between samples #100

Closed potterae closed 1 year ago

potterae commented 1 year ago

Hi, thank you for the program, it is very useful. We are looking to compare TCR/BCR sequences between samples to observe which cdr3 sequences are the same between samples. Is there a way to generate a list of conserved CDR3 sequences between multiple samples in a series, and perhaps a % which are the same as well? Using the calc_similarity with abdiv-jaccard, between two samples which appeared by eyeballing to have many overlapping TCR cdr3 sequences, it provided a jaccard calculation of 0.9712919 (which indictaes mostly dissimilar?). Thank you.

Andrew

sheridar commented 1 year ago

Hi @potterae, sorry about the delayed response. Currently, there is no function to provide a list of shared clonotypes (or CDR3 sequences), but this is something I've thought about and may include when the package is released.

How are you running calc_similarity()? By default this function will calculate overlap using a 'clonotype_id' column. So some cells may share a CDR3 sequence for one of the chains, but belong to different clonotypes since the assigned clonotype IDs will be based on both chains.

I have rewritten some parts of calc_similarity() to allow the user to specify a specific chain, which would allow you to easily quantify the overlap between CDR3 sequences (not just clonotypes). However, I am still testing some of these updates and have not merged these changes yet (but this will be included in the near future).

A workaround to calculate CDR3 overlap (instead of clonotype overlap) between samples is to first filter the chains to only include the alpha or beta and then calculate overlap:

# to calculate similarity based on TRB CDR3 sequence, first filter to only keep TRB chains
# as a precaution remove data for any cells that have multiple TRB chains
# next run calc_similarity, but for the clonotype_col argument specify the column containing CDR3 sequences
obj %>%
  filter_vdj(chains == "TRB") %>%
  filter_vdj(length(chains) == 1) %>%
  calc_similarity(
    cluster_col = "sample",
    clonotype_col = "cdr3",
    return_mat = TRUE
  )
potterae commented 1 year ago

plot <- calc_similarity(input = Patient4.sub, cluster_col="Biopsy", method=abdiv::jaccard, clonotype_col="TCR_cdr3", return_mat = TRUE)

Hi Ryan, thank you for your response! I've actually been using it on an integrated dataset setting the clonotype_col as TCR_cdr3 (made with prefix). It looks like the clonotype_id corresponds to the cellranger output for each sample, so there are different clones in our dataset each labeled "clonotype1" for example, so I've used the actual CDR3 info in the TCR_cdr3 column to calculate similarity. As this is a pre-filtered dataset (not containing all barcodes), I also plan to merge the unfiltered data into a dataset to include all clones which correspond to a cell barcode. It would be a great feature to also have a breakdown by chain in addition to the entire clonotype, thank you for providing the workaround. Having a list of overlapping clones would also be awesome. I took a closer look at the Jaccard dissimilarity statistic and the clonotypes in our dataset, and the output looks good. Thank you!

sheridar commented 1 year ago

Another way to deal with multiple cellranger runs is to use the 'define_clonotypes' argument for import_vdj(). This will set new clonotype IDs based on the combined CDR3 sequences and/or VDJ segments for each cell.

I'm going to close this since it sounds like your issue is resolved. If you run into any other issues or have any other questions/suggestions please feel free to open a new issue