immunomind / immunarch

🧬 Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
312 stars 65 forks source link

circos plot or matrix reporting combinatorial usage of V and J chains (D?) #103

Open splaisan opened 4 years ago

splaisan commented 4 years ago

🚀 Feature

a function producing matrices or circos plot for a given sample with the different V-J combinations as edges and number of clones/reads as weight as done in vdjtools. (+ option for filtering low-representation combinations in the process)

Or a function reporting a matrix of V x J with frequencies/counts which would be enough as the circos plotting is apparently already there.

Motivation

provide this missing plot to complete the toolbox when compared to vdjtools

splaisan commented 4 years ago

I have create a function to do it based on what I found in circlize doc and the vis_circos buitl-in (not very documented!)

I am not able to submit this code as a pull request but if someone could do or add this in the tutorial, you are welcome to take my code (when it is correct of course)

create input from one single sample (eg A2-i129) data from a immunarch dataset (immdata

data(immdata)
mydat <- immdata$data$`A2-i129`

creates a circos plot of V and J usage from the data

circos.clear()
circos.par("canvas.xlim" = c(-1.25, 1.25), "canvas.ylim" = c(-1.25, 1.25))
vis_genus(mydat, title="A2-i129_sample")

221c9854-65cf-4f73-a9ff-bddd0b8d76ff

function used above

vis_genus <- function(x, title="", ...) {
# require("stringr", "circlize")
V <- sort(unique(unlist(strsplit(paste(x$V.name, collapse=","), ","))))
J <- sort(unique(unlist(strsplit(paste(x$J.name, collapse=","), ","))))

# create and fill matrix of pairwise with sum(Clones)
mat <- matrix(nrow = length(J), ncol = length(V))
colnames(mat) <- V
rownames(mat) <- J
for (j in seq(1:length(J))) {
  for (v in seq(1:length(V))) {
  sub <- x[(str_detect(x$V.name, V[[v]], negate = FALSE) & 
               str_detect(x$J.name, J[[j]], negate = FALSE)),]
  mat[j,v] <- sum(sub$Clones)
  }
}

# convert to proportions / fraction of total
mat <- mat/sum(mat)

# reorder by V and J decreasing order (left to right)
vmax <- colSums(mat)
jmax <- rowSums(mat) 
mat2 <- mat[order(jmax, decreasing = FALSE), order(vmax, decreasing = TRUE)]

# use circlize
chordDiagram(mat2, annotationTrack = "grid", ...)

# add legends
circos.track(track.index = 1, panel.fun = function(x, y) {
  circos.text(CELL_META$xcenter, 
              CELL_META$ylim[1], 
              CELL_META$sector.index, 
              facing = "clockwise", 
              niceFacing = TRUE,
              adj = c(-0.25, 0.5))
}, bg.border = NA)

# add title
title(title)
}
J-Noonan commented 4 years ago

Hi @splaisan - Hope you're well!

Thanks for making this, its a fantastic way to represent TCR data. I'm currently working out also how to modify this for data with paired alpha-beta sequences. If I'm successful Ill make sure to share.

I'm a Bioinformatics/R newbie and I was hoping you might further explain your function to me. My interpretation is that to appropriately represent the data, at some point you would need to multiply each V-J pairing (or row of data) by the "Clones" column in the sample data. I note that there is a calculation involving sub$Clones, but being unfamiliar with R I'm not 100% sure if this is doing what I think.

The reason I ask is that when I apply this to my own data, it seems to over represent rare clones and under represent common clones and I haven't been able to work out why. Looking at the graph produced on the A2-i129_sample - You would expect a striking feature to be the TRBV4-1 TRBJ2-2 and TRBV4-1 TRBJ2-3 pairings, which have the most clones for this sample. Whilst you can see a clear link between these, the circos suggests a higher representation of the TRBV2 and TRBJ2-5 pairing. But, unless im mistaken, it appears only 103 clones have this pairing. Any light you could shed on this would be great.

Thanks!

J