scverse / scirpy

A scanpy extension to analyse single-cell TCR and BCR data.
https://scirpy.scverse.org/en/latest/
BSD 3-Clause "New" or "Revised" License
206 stars 32 forks source link

IEDB database cdr3_aa stored as junction_aa #469

Open racng opened 7 months ago

racng commented 7 months ago

Describe the bug IEDB database provides cdr3_aa sequences instead of junction_aa sequences. scirpy database import code puts this in the slot for junction_aa. The sequence would then be missing the flanking amino acids. This makes it inaccurate to do identity string matching with ir_dist/ir_query.

To Reproduce https://github.com/scverse/scirpy/blob/d862cf35740a79e91f95c49ce6da1fbe280f8c1c/src/scirpy/datasets/__init__.py#L338C1-L371C10

Expected behaviour Is there a way to format cdr3 into junction sequence? like add the missing "C" at the beginning based on V call? Do you have advice on how to do this?

grst commented 7 months ago

Hi Rachel,

thanks for bringing this up. This was meant as a workaround to make it work with ir_dist (which is currently hardcoded to work on the junction sequence).

A proper solution is probably to

For now, as a workaround, I would suggest to store CDR3 sequences in junction_aa in both query and database object. You can trim the string with awkward array functions:

import awkward as ak
mdata["airr"].obsm["airr"]["junction_aa"] = ak.str.slice(mdata["airr"].obsm["airr"]["junction_aa"], 1, -1)
grst commented 7 months ago

@zktuong, just wanted to ask you to be sure

zktuong commented 7 months ago

junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?

yes should be correct.

cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?

yes first is always C and the last should always be either F/W. Should be able to infer the F/W from the second codon from start of the J call (TTT/TTC for F or TGG for W).

grst commented 6 months ago

Hi @racng,

I believe I fixed this in #476, by extracting the junction_aa sequence from the full protein sequence based on the start/end coordinates provieded in IEDB. Would be great if you could give this a try!

You can install that version using

pip install git+https://github.com/scverse/scirpy@issue-469

Please make sure to remove the cached version of iedb before rerunning ir.datasets.iedb by deleting the data directory that was created.

racng commented 6 months ago

@grst Thank you for working on a fix for this issue! I have copied the new function into a script and tested it. For some reason I am getting a lot of NaN values for both junction_aa and cdr3_aa.

adata = iedb(cached=True, cache_path='test/iedb.h5ad')

ir.get.airr(adata, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 30080

ir.get.airr(adata, 'cdr3_aa')['VDJ_1_cdr3_aa'].isna().sum()
# 30110

# Old copy of iedb reference
ir.get.airr(adata_old, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 118
grst commented 6 months ago

Bad news! I took another look and it seems that indeed the Start/End position and and Protein sequence are only available for ~5000 receptors:

>>> iedb_df["Chain 1 Protein Sequence"].dropna().size
5083

For the rest, there is "CDR3 Curated", but not "CDR3 Calculated" available. Taking a closer look at "CDR3 Curated", it seems that some (but not all) sequences there are actually junction sequences, including the C and W/F.

image

So I'm afraid this would take quite some cleanup to get it right!


On the scirpy side, I could consider adding the option to use cdr3_aa sequences, instead of junction_aa, for sequence comparisons. Then the CDR3 sequences in the database could remain as they are.

racng commented 6 months ago

I think I found a solution. We can use the J-motif sequences in J-region-motifs.tsv generated by IMGTgeneDL run in stitchr mode to determine the terminal residue of the J gene. It "contains automatically inferred CDR3 junction ending motifs and residues (using the process established in the autoDCR TCR assignation tool)".

I am attaching it here: J-region-motifs.txt

grst commented 6 months ago

@zktuong, what do you say about this one? Is that wrong or the always existing exception to the rule in Biology? image

(L and V are "not confident" but C is) (this is from above J-junction-motifs.txt)

zktuong commented 6 months ago

Looks like they are correct and are treated as exceptions, but probably non-functional

For TRAJ35*01:

(6) noncanonical J-MOTIF: Cys-Gly-X-Gly instead of Phe-Gly-X-Gly. The functionality of TRAJ35*01, initially considered as ORF owing to the presence of Cys (C118) instead of Phe (F118) in the J-MOTIF, has been changed to F (functional) as TRAJ35*01 has been found rearranged and expressed in several specific cDNA clones (comments added by G. Folch and M.-P. Lefranc on the 23/01/2019, following the identification of an additional clone and a question by C. Joos (Germany)).

https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ Genecard says it's non-functional https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRAJ35

TRBJ2-2P*01 also looks like an exception and non-functional open reading frame

(1) This sequence has diverged considerably from that of other TRBJ2-2: it has no canonical J-NONAMER [2], J-PHE 118 is replaced by Gly (G) and G 119 by R in J-MOTIF.

https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRBJ https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRBJ2-2P

TRBJ2-7*02

(2) J-PHE replaced 118 replaced by Val in J-MOTIF.
zktuong commented 6 months ago

i just checked the IgBLAST auxiliary files that indicate where the CDR3 end is and crossed it to the fasta files and it looks like the codons are correct for those 3 genes as well. so should be alright to use the J-junction-motifs.txt