kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
55 stars 17 forks source link

Gene names should allow alternate delimiters #106

Open kamurani opened 3 months ago

kamurani commented 3 months ago

I recently ran into a UserWarning when trying to run tcrdist on several gene names in my dataset:

tcrdist/repertoire.py:504: UserWarning: TRAV16D-DV11*01 gene was not recognized in reference db no cdr seq could be inferred

This is because the gene names expected by tcrdist use / delimiters in certain gene names (e.g. TRAV13-4/DV7*01) but my dataset uses gene names with - characters as this delimiter (in this case, TRAV13-4-DV7*01).

Is there an easy way to modify tcrdist to support either one? At the moment, I am using a simple fix to bypass this issue by mapping all my gene IDs to the tcrdist /-version as follows:

from pathlib import Path 
# Get db file that will be used in tcrdist `TCRep` constructor 
tcr_db_path = Path("~/path/to/tcrdist") / "db" / "alphabeta_gammadelta_db.tsv"
tcr_db = pd.read_table(tcr_db_path) 

# Create mapping from `/` characters to `-` (trivial to replace the `/` with a `-`, considering they converge on the 'all dashes' version), then use reverse mapping to get correct (according to `tcrdist`) name
original_id_list = tcr_db.id
gene_id_list = original_id_list.apply(lambda x: x.replace("/", "-"))
gene_id_dict = dict(zip(gene_id_list, original_id_list))

# Use dict to replace gene IDs in dataset 
dff["v_a_gene"] = dff["v_a_gene"].apply(lambda x: gene_id_dict.get(x, x))