Open racng opened 7 months ago
Hi Rachel,
thanks for bringing this up. This was meant as a workaround to make it work with ir_dist
(which is currently hardcoded to work on the junction sequence).
A proper solution is probably to
For now, as a workaround, I would suggest to store CDR3 sequences in junction_aa
in both query and database object.
You can trim the string with awkward array functions:
import awkward as ak
mdata["airr"].obsm["airr"]["junction_aa"] = ak.str.slice(mdata["airr"].obsm["airr"]["junction_aa"], 1, -1)
@zktuong, just wanted to ask you to be sure
junction_aa
-> cdr3
should be trivial, right? Just clip off the first and last amino acid. Any caveats?cdr3
-> junction_aa
: Is it correct that the first AA is always C
and the last is always W/F
? Can this be robustly inferred from the V call? junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?
yes should be correct.
cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?
yes first is always C and the last should always be either F/W. Should be able to infer the F/W from the second codon from start of the J call (TTT/TTC for F or TGG for W).
Hi @racng,
I believe I fixed this in #476, by extracting the junction_aa
sequence from the full protein sequence based on the start/end coordinates provieded in IEDB. Would be great if you could give this a try!
You can install that version using
pip install git+https://github.com/scverse/scirpy@issue-469
Please make sure to remove the cached version of iedb before rerunning ir.datasets.iedb
by deleting the data
directory that was created.
@grst Thank you for working on a fix for this issue! I have copied the new function into a script and tested it. For some reason I am getting a lot of NaN values for both junction_aa and cdr3_aa.
adata = iedb(cached=True, cache_path='test/iedb.h5ad')
ir.get.airr(adata, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 30080
ir.get.airr(adata, 'cdr3_aa')['VDJ_1_cdr3_aa'].isna().sum()
# 30110
# Old copy of iedb reference
ir.get.airr(adata_old, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 118
Bad news! I took another look and it seems that indeed the Start/End position and and Protein sequence are only available for ~5000 receptors:
>>> iedb_df["Chain 1 Protein Sequence"].dropna().size
5083
For the rest, there is "CDR3 Curated", but not "CDR3 Calculated" available. Taking a closer look at "CDR3 Curated", it seems that some (but not all) sequences there are actually junction sequences, including the C
and W/F
.
So I'm afraid this would take quite some cleanup to get it right!
On the scirpy side, I could consider adding the option to use cdr3_aa
sequences, instead of junction_aa
, for sequence comparisons. Then the CDR3 sequences in the database could remain as they are.
I think I found a solution. We can use the J-motif sequences in J-region-motifs.tsv
generated by IMGTgeneDL run in stitchr mode to determine the terminal residue of the J gene. It "contains automatically inferred CDR3 junction ending motifs and residues (using the process established in the autoDCR TCR assignation tool)".
I am attaching it here: J-region-motifs.txt
@zktuong, what do you say about this one? Is that wrong or the always existing exception to the rule in Biology?
(L and V are "not confident" but C is)
(this is from above J-junction-motifs.txt
)
Looks like they are correct and are treated as exceptions, but probably non-functional
For TRAJ35*01:
(6) noncanonical J-MOTIF: Cys-Gly-X-Gly instead of Phe-Gly-X-Gly. The functionality of TRAJ35*01, initially considered as ORF owing to the presence of Cys (C118) instead of Phe (F118) in the J-MOTIF, has been changed to F (functional) as TRAJ35*01 has been found rearranged and expressed in several specific cDNA clones (comments added by G. Folch and M.-P. Lefranc on the 23/01/2019, following the identification of an additional clone and a question by C. Joos (Germany)).
https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ Genecard says it's non-functional https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRAJ35
TRBJ2-2P*01 also looks like an exception and non-functional open reading frame
(1) This sequence has diverged considerably from that of other TRBJ2-2: it has no canonical J-NONAMER [2], J-PHE 118 is replaced by Gly (G) and G 119 by R in J-MOTIF.
https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRBJ https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRBJ2-2P
TRBJ2-7*02
(2) J-PHE replaced 118 replaced by Val in J-MOTIF.
i just checked the IgBLAST auxiliary files that indicate where the CDR3 end is and crossed it to the fasta files and it looks like the codons are correct for those 3 genes as well. so should be alright to use the J-junction-motifs.txt
Describe the bug IEDB database provides cdr3_aa sequences instead of junction_aa sequences. scirpy database import code puts this in the slot for junction_aa. The sequence would then be missing the flanking amino acids. This makes it inaccurate to do identity string matching with ir_dist/ir_query.
To Reproduce https://github.com/scverse/scirpy/blob/d862cf35740a79e91f95c49ce6da1fbe280f8c1c/src/scirpy/datasets/__init__.py#L338C1-L371C10
Expected behaviour Is there a way to format cdr3 into junction sequence? like add the missing "C" at the beginning based on V call? Do you have advice on how to do this?