kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

Several (more?) missing TCR chains #81

Open KJMaroney opened 1 year ago

KJMaroney commented 1 year ago

Hello, I feel like this should be a "simple" fix. When reading in my TCR a/B chains formatted as tcrdist expects, several easily interpretable errors are thrown in this format:

f0 = lambda v : self._map_gene_to_reference_seq2(gene = v, \AppData\Local\Programs\Python\PYTHON~1\lib\site-packages\tcrdist\repertoire.py:500: UserWarning: TRBV5-2*01 gene was not recognized in reference db no cdr seq could be inferred

I have replaced all of the names that were like "TRAVDV14-etc." -> "TRAV/DV14-etc." as per your db document. However, TRBV5-2*01 and several other families are not included. I'm just listing the ones in my dataset that are not in your database (Possibly because these are HLA-E restricted TCR's, so maybe interesting anyway?):

TRAV2801 TRBV22-101 TRBV5-2*01

I'm not sure what the strategy would be to add them, if it's as simple as adding the family and CDR3, or pulling from somewhere, etc. I would appreciate if there was a way for you to add these, because when generating the gene_pairing graph, one of the top pairs is "other" (attached) and I imagine it will impact the "correct" determination of most common shared and etc. metaclonotypes in my non-10X dataset. Thank you!

E01_gene_usage_plot