ncborcherding / scRepertoire

A toolkit for single-cell immune profiling
https://www.borch.dev/uploads/screpertoire/
MIT License
301 stars 49 forks source link

Statistics and plots based on TRA OR TRB chain #71

Closed rcarrr closed 3 years ago

rcarrr commented 3 years ago

Hi, I successfully performed TCR repertoire analysis on my single cell dataset. Now I would like to perform some statistics and visualize them based only on one of the two chains (for example calculate cloneType frequency considering only TRA chain and visualize it through DimPlot seurat function). There are some cases in which a same TRA rearrangement could be coupled with different TRB chains and if we look at the pair we lose this information on TRA chain. Is there a way to perform it?

Thanks a lot!

ncborcherding commented 3 years ago

Hey rcarr,

Apologies no built in functions for single-chain as of now.

However, all the chain information is stored in the meta data after combinedExpression(), you will just need to split the chain/nt/aa string using something like:

nt <- stringr::str_split(seurat.object[[]]$CTnt, "_", simplify =T)[,1] #Column 1 will get you the TCRA chain

You can then calculate clonotype frequency by just using the below code and eventually appending your meta data.

freq <- data.frame(table(nt))

Let me know if you have any more questions, more than happy to work through getting 1 chain analyses set up for you.

Thanks, Nick

GBeattie commented 3 years ago

Hey, just a quick double check on this for my own piece of mind.

I used your approach above to split the TRA + TRB chains for use in my seurat objects metadata. I do some work comparing bulk TCR with scTCR so what I used to do is grep for the TRB found in bulk within the CTaa metadata, however I've now noticed that occasionally a TRB sequence is found in the TRA slot of CTaa. I'm guessing this can occur naturally, and it's not a mixup of which side of the "_" each chain goes?

Quick image to show a search for a bulk TRB chain inside CTaa, most are on the right (TRB) side, but a handful are on the left (TRA) side. image

Thanks in advance for any assistance!

ncborcherding commented 3 years ago

Hey Gordan,

Thanks for following up on this and trying it out!

Do you have the TCR gene information for the possibly flipped sequences? Is the sequence in the TCRA column contain TCRB genes in the CTnt variable? That will help tremendously in terms of narrowing down the issue.

Thanks, Nick

GBeattie commented 3 years ago

No problem, I've done a little digging and it looks like the chains have been flipped, so using a str_split to extract TRA + TRB won't work, here's how I checked.

I searched for the cell above in the 10x VDJ output that had the switched chains and that highlighted cdr3 is a TRB, not a TRA View(vdj.list[["LTX019"]][, c("barcode", "chain", "cdr3")]) image

I checked the CTgene for that cell also, seems to be in the right order (TRA_TRB), but no TRA associated genes found

seu$CTgene["LTX019_AGCAGCCAGAATAGGG-1"]
       LTX019_AGCAGCCAGAATAGGG-1 
"NA_TRBV20-1.TRBJ1-1.None.TRBC1" 

Here's the CTnt variable for that cell also

seu$CTnt["LTX019_AGCAGCCAGAATAGGG-1"]
                                                        LTX019_AGCAGCCAGAATAGGG-1 
"TGTGCCAGCAGCTTCCTCTCAGGAACAGATACGCAGTATTTT_TGCAGTGCTAGAGATCTGGCATCCGGAGCTTTCTTT" 
ncborcherding commented 3 years ago

Hey Gordan,

This makes me wonder - do all 4 chains (ignoring the multi) that you are showing below are listed as productive? I ask because I can see an instance where if there are only 2 chains that are productive both TCRB (or TCRA) that they can be incorporated in the opposite position.

image

The internal parseTCR() function that is used in the combineTCR() assumes if you have two chains, that one is TCRA and the other is TCRB, but this might need to be changed as newer versions of Cell Ranger are reporting more contigs for each barcode.

Thanks for helping me troubleshoot this! I really appreciate it.

Nick

GBeattie commented 3 years ago

Yes all the chains for that cell are productive, see below;

View(vdj.list[["LTX019"]][, c("barcode", "productive", "chain", "cdr3")]) image

No problem, great package, can't be easy with CellRanger changing their outputs (a third party ran CellRanger, otherwise I'd tell you the version used, can find out if important)

ncborcherding commented 3 years ago

Hey Gordon,

Sorry for the confusion, I was asking if the productive header had all 4 contigs listed as TRUE, if not, they are filtered out in the combine TCR problem. Which is what I suspect, because right now, scRepertoire only handles 3 chains per barcode. If there were 4 or more, the resulting output would be NA for that barcode.

So you problem could be solved I think if you the filterMulti parameter to find the top TCRB and TCRB chains (removes the lowly expressed chains):

combineTCR(...filermulti = TRUE)

I also updated the parseTCR internal function from assuming that if 2 chains are fed into the combineTCR() function, there is no longer the assumption that one is TCRA and TCRB, there is now a conditional verification so this does not happen to other analyses.

Old Code:

Screen Shot 2021-05-09 at 7 39 29 AM

New Code:

Screen Shot 2021-05-09 at 7 40 01 AM

I am going to close this comment, but if things persist, more than happy to reopen and keep troubleshooting.

Thanks again for pointing out the issue.

Nick

GBeattie commented 3 years ago

Thanks for taking a closer look, I was using filtermulti so not sure how there was still an issue. We're now thinking for this analysis to ignore TRA so this should be less of an issue for us now (I'll just remove TRAs from the cellranger output directly), I will make sure to keep an eye on this though and I'll update here if I notice anything.

GBeattie commented 3 years ago

Hey, I updated my version of scRepertoire anyway, and also removed TRBs as I mentioned I would. Seeing lots of strange behaviour in terms of the TRA_TRB ordering (despite there being no TRA)

The example cell has two TRB chains detected but no cdr3b for one of them (why Cellranger called this one as productive I'm not sure), the detected cdr3 ends up in the TRA side of CTaa. There are also several cells with both TRA and TRB slots filled, despite them both being TRBs. image

image

ncborcherding commented 3 years ago

Dear Gordon,

Well shoot, thought I had it fixed, but thanks for the follow up.

Things I adjusted this morning

  1. scRepertoire now filters out sequences reported as "None" in the CDR3 sequence - this must be one of the subtle changes in Cell Ranger because I cannot find an instance of it in any of my TCR or BCR sequences I have.
  2. Added an additional conditional to the parseTCR() function to ensure all sequences are now evaluated in the context of either being TCRA or TCRB.
  3. Added a small change to the filterMulti parameter to ensure the highest expressing TCRA and highest expressing TCRB are now filtered for.

Please reload the scRepetoire github package and see if this fixes the results, I will keep this issue open until I get a confirmation.

Thanks for bearing with me, Nick

GBeattie commented 3 years ago

No problem, glad to help!

I reinstalled and ran with TRA filtered out, still getting some TRB on the TRA side. Just double checking that I'm installing correctly so I get the version with your changes, I used devtools::install_github("ncborcherding/scRepertoire") to install and the version is 1.0.3.

ncborcherding commented 3 years ago

Hey Gordon,

Yes that is the correct version - v1.0.3.

Would you mind sending me a reproducible example via email (either a zip or some cloud link)? I am unable to reproduce the issue with my data and would be really interested in getting to the source of the issue.

Thanks, Nick

GBeattie commented 3 years ago

Problem solved! I was making a repex and noticed everything was working fine now, turns out I needed to restart R after uninstalling/reinstalling, my bad. So your fixes yesterday must have done the trick! Thanks very much for your help with this, very useful to have a clear TRA/TRB distinction in the metadata.

ncborcherding commented 3 years ago

Hey Gordon,

Thank you for the follow up, glad things worked out in the end. Please don't hesitate if you have any future issues or suggestions!

Nick

GBeattie commented 3 years ago

Hey! Sorry to open this again but found some strange behaviour in relation to the CTaa TRA_TRB distinction. I read in some data where I'm only interested in TRBs, so I filtered the cellranger output for TRB only. It looks like in cases where there are two TRBs for one cell one of the TRBs get's put in the TRA side of CTaa. I've used filterMulti = T, also the order doesn't seem to depend on the umi count. Example below (second image is excerpt from original input).

image

image

GBeattie commented 3 years ago

I just double checked this cell after running it without removing the TRA (in case this was causing the mixup), however the issue is still there, i.e. even though there's two of each chain for this cell, the CTaa format is TRB_TRB.

Would it be possible to create explicit columns in the metadata for TRBs and TRAs during combineExpression? The order doesn't matter too much for clonality/expansion metrics using both chains, but if I need to use one chain, or do clustering of particular groups of cdr3bs, the occasional mixing up of the order in CTaa is worrisome.

image

image

ncborcherding commented 3 years ago

Hey Gordon,

Sorry on mobile -

you are completely correct. I did not anticipate a user filtering for a single chain before the use of combineTCR() thus the issue. I will change this in the new version of the package - my thought would be to offer the filtering in the combineTCR() function? Or do you think I should make combineTCR() accessible to user-filtered data?

I’m on vacation this week with some spotty cell phone coverage. I will get in touch when I’m back and can troubleshoot the problem more (and commit some changes to the dev branch).

sorry for the delay and thanks as always for pointing out issues (it is much appreciated!!).

Nick

GBeattie commented 3 years ago

Hey Nick,

Thanks for the response and no rush! Although my main concern is that occasionally the CTaa format is TRB_TRB even when TRA are supplied in the case where there's more than one TRB for a cell (at least in a few of the example cells I spotted).

The only reason I would say the later, i.e. make combineTCR work whether only TRB (or only TRA) are supplied, is that I was recently processing some publicly available data that only provided the TRB, I formatted the data into a cellranger vdj-like input and ran it through scRepertoire. For that particular instance an inbuilt filtering system wouldn't have been useful, but a very special case so I wouldn't mind either way!

ncborcherding commented 3 years ago

Hey Gordon,

I was struggling with trying to replicate the problem - to me, the using filterMulti should have solved the problem as it is taking the single top expressing chain. However a closer look, I found that it is only doing this for cells with > 2 contigs and not >= 2 clonotypes, which means sporadically, if cells had either just 2 TRA or 2 TRB, it was not undergoing the filterMulti step. I have updated line 87 below - would you mind reinstalling the dev version & restarting R to see if this solves the problem?

Screen Shot 2021-06-13 at 11 03 53 AM

Thanks, Nick

GBeattie commented 3 years ago

Hey Nick,

Thanks for that, I'm on holiday most of this week but I'll test this out and get back to you when I have the chance!

ncborcherding commented 3 years ago

Enjoy the holiday - I will leave this issue open until I get a confirmation from you. No rush whatsoever.

GBeattie commented 3 years ago

Hey Nick,

Re-installed the dev version and tried on some new data, seems to be working fine, I'll remain vigilant though, feel free to close this. Thanks again!

ncborcherding commented 3 years ago

Hey Gordon,

Thanks so much for the follow up and vigilance. I will also keep an eye out to see if there are any issues.

Nick