I recently ran into a UserWarning when trying to run tcrdist on several gene names in my dataset:
tcrdist/repertoire.py:504: UserWarning: TRAV16D-DV11*01 gene was not recognized in reference db no cdr seq could be inferred
This is because the gene names expected by tcrdist use / delimiters in certain gene names (e.g. TRAV13-4/DV7*01) but my dataset uses gene names with - characters as this delimiter (in this case, TRAV13-4-DV7*01).
Is there an easy way to modify tcrdist to support either one? At the moment, I am using a simple fix to bypass this issue by mapping all my gene IDs to the tcrdist/-version as follows:
from pathlib import Path
# Get db file that will be used in tcrdist `TCRep` constructor
tcr_db_path = Path("~/path/to/tcrdist") / "db" / "alphabeta_gammadelta_db.tsv"
tcr_db = pd.read_table(tcr_db_path)
# Create mapping from `/` characters to `-` (trivial to replace the `/` with a `-`, considering they converge on the 'all dashes' version), then use reverse mapping to get correct (according to `tcrdist`) name
original_id_list = tcr_db.id
gene_id_list = original_id_list.apply(lambda x: x.replace("/", "-"))
gene_id_dict = dict(zip(gene_id_list, original_id_list))
# Use dict to replace gene IDs in dataset
dff["v_a_gene"] = dff["v_a_gene"].apply(lambda x: gene_id_dict.get(x, x))
I recently ran into a
UserWarning
when trying to run tcrdist on several gene names in my dataset:This is because the gene names expected by
tcrdist
use/
delimiters in certain gene names (e.g.TRAV13-4/DV7*01
) but my dataset uses gene names with-
characters as this delimiter (in this case,TRAV13-4-DV7*01
).Is there an easy way to modify
tcrdist
to support either one? At the moment, I am using a simple fix to bypass this issue by mapping all my gene IDs to thetcrdist
/
-version as follows: