Dealing with alleles and multiple gene calls

grst commented 1 month ago

I want to report some kind of bug/issue that I recognized during my work in Scirpy and has to do with scirpy.tl.define_clonotype cluster (but likely also with scirpy.tl.define_clonotype, although this was not tested). Scirpy considers any string inside v_call (same problem with j_call) as a unique V-gene assignment, and this is perfectly fine for working with Cell Ranger annotation. However, if we are working with re-annotated data, which is done with IgBlastn or IMGT/Highv-quest this is not true any more. First, the annotation contains alleles, which are depicted like this IGHV3-33*001. The problem is that if another cell would have IGHV3-33*002 these two cells would always be separated by Scirpy (if v_gene = True), because Scirpy thinks that those are totally different genes, although they only differ in their alleles, even if everything else even the junction sequence can be quite similar. Another issue with using IgBlastn or IMGT/Highv-quest re-annotation is that they often leave multiple possible gene assignments into the v_call column if they are all similar likely. What I did so far with my datasets was that I manually manipulated the v_call and j_call column prior to loading the datatset into an AnnData object so that they only contain the first v/j_call and without the allele information.

My idea here would be to adapt the clonotype cluster function so that it automatically ignores multiple v_call's/j_call's i.e. only considers the first one and also ignores the allele information for clustering, but doesn't manipulate the call itself. Immcantation has a own parameter on how to work with multiple calls for a gene (see "parameter first= FALSE": https://scoper.readthedocs.io/en/stable/topics/hierarchicalClones/). Actually I encountered this problem already some time ago and discussed it with @felixpetschko but eventually we both forgot about it until now. Either way, I think it's good if @grst can also have a look on this problem and help with a solution, because if I remeber correctly it's not that trivial to "fix" this. Maybe there is some elegant workaround available?

Originally posted by @MKanetscheider in https://github.com/scverse/scirpy/issues/542#issuecomment-2309825187

grst commented 1 month ago

@MKanetscheider, I made this a separate issue.

grst commented 1 month ago

@zktuong, would be curious if and how you have dealt with this before?

From a technical perspective I could see two solutions: 1) Simple workaround: Provide a "preprocessing" function that maniuplates the v/j_call string to strip away the allele information and only keep the first gene (what @MKanetscheider has been doing manually so far) 2) The define_clonotype_cluster function already uses a distance matrix to deterime distance between genes. For now, that's a boring identity matrix. This matrix could be replaced with something like

IGHV3-33*002	IGHV3-33*001	IGHV27	...
IGHV3-33*002	1	1	0	0
IGHV3-33*001	1	1	0	0
IGHV27	0	0	1	0
...	0	0	0	1

zktuong commented 1 month ago

yea i just only do:

https://github.com/zktuong/dandelion/blob/7a86c211da0a3eab6863921d2dfbbd562feeeaa9/dandelion/tools/_tools.py#L1660C13-L1660C79 V = [re.sub(STRIPALLELENUM, "", str(v)) for v in input_vdj[VCALL]]

but i don't select the first one:

https://github.com/zktuong/dandelion/blob/7a86c211da0a3eab6863921d2dfbbd562feeeaa9/dandelion/tools/_tools.py#L1669C4-L1669C55 V = [",".join(list(set(v.split(",")))) for v in V]

can be easily changed to just v.split(",")[0] in my case

grst commented 1 month ago

I think the matrix solution would be quite elegant, as it would also allow us to deal with the case of multiple genes, e.g. to ensure IGHx,IGHy matches both IGHx and IGHy we could have

	IGHx	IGHy	IGHx,IGHy
IGHx	1	0	1
IGHy	0	1	1
IGHx,IGHy	1	1	1

Basically all we'd need is a function that builds such a matrix from the gene names in plug it into the clonotype definition module.

@felixpetschko, do you think you could look into that?

CC @FFinotello

felixpetschko commented 3 weeks ago

Hi @grst, I just took a look at my improved implementation of the v/j-gene matching and I think the matrix solution you suggested should work fine. If this implementation is not that urgent I would actually tackle this problem in a few weeks when I completed my thesis if that's fine.

grst commented 3 weeks ago

sounds good, thanks!

scverse / scirpy

Dealing with alleles and multiple gene calls #561