using Ibex when dataset was not processed with scRepertoire

cstrlln commented 1 year ago

Hello, this is a great idea, been looking to try something like this and looked like the T cell people were more advanced.

Have a couple of questions:

I already have an SCE object with integrated vdj data, I have the aa sequences, V gene identity and of course barcode, and I have subsetted so they are all heavy chain only. Is there a way to use this without starting again with scRepertoire?. Looks from getBCR that I could change the name of my colnames in colData for aa, v genes. to match the ones used by scRepertoire.... Would this work? what else would I need to change or make sure to go directly to runIbex with my SCE.
which dataset was used for the training? Sorry I might have missed it and not very familiar yet with machine learning.

Finally, a suggestion: Pulling V genes just with grep from GEX data is not ideal as there are a lot of pseudogenes, I would suggest using chromosomal location or the biotype assigned by cellranger, soomething like: ig_list <- c("IG_C_gene", "IG_C_pseudogene", "IG_D_gene", "IG_D_pseudogene", "IG_J_gene", "IG_LV_gene", "IG_pseudogene", "IG_V_gene", "IG_V_pseudogene")

And then can query into biomart for genes that have that biotype that are also in your dataset.

Carlos

ncborcherding commented 1 year ago

Hey Carlos,

Thanks for reaching out - you're right the single-cell RNA/BCR space is a little sparse - but you should also check out Benisse too.

In terms of questions:

Yes you should be able to modify the meta data by matching the names to the getBCR() internal function. Overall the function relies on two columns: 1) CTaa which is the cdr3 amino acid sequence of the BCR, formatted into "Heavy_Light" and 2) *CTgene which is the gene segments used for both chains, formatted like "HV.HD.HJ.HC_LV.LJ.LC", this should be enough to get the pipeline working.
Great questions - this is getting clarified in the resubmission of the paper with a comprehensive list of cohorts. But the models were trained on all public single-cell BCR sequences deposited in the Gene Expression Omnibus that were available before November 2022. I am updating the models with additional sources as well for future versions.

Great suggestion - I will add that to the list of changes to implement for the new release!!!

Hope that answers your questions and please let me know if you have any other questions/suggestions as you start using the package.

Nick

cstrlln commented 1 year ago

Thanks Nick, I'll give it a try getting my data to work with ibex.

Another related question: what is the role of the V genes here, how are they used? Are they just kept for referencing? I gather the calculations are based on the aa properties.

Carlos

ncborcherding / Ibex

using Ibex when dataset was not processed with scRepertoire #2