kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

Wrong/redundant entries in db file? #83

Open nh3 opened 1 year ago

nh3 commented 1 year ago

Hello,

There seem to be wrong/redundant entries in alphabeta_gammadelta_db.tsv that place TRAV under "B" chain, e.g. https://github.com/kmayerb/tcrdist3/blob/master/tcrdist/db/alphabeta_gammadelta_db.tsv#L1053. Is it expected?

andreas-wilm commented 4 days ago

That is wrong. For human it's luckily limited to duplicated alpha chains wrongly classified as beta. So at least it's easy to tell, which entries are wrong.

In [1]: import pandas as pd
In [2]: df = pd.read_csv("miniforge3/envs/tcrdist3-0.2.2/lib/python3.10/site-packages/tcrdist/db/alphabeta_gammadelta_db.tsv", sep="\t")
In [3]: df = df[df['organism'] == 'human']
In [4]: m = df['id'].duplicated(keep=False)
In [5]: sum(m)
Out[5]: 206
In [6]: df[m]['chain'].value_counts()
Out[6]:
chain
A    103
B    103
Name: count, dtype: int64
In [7]: sum(df[m]['id'].str.startswith('TRAV'))
Out[7]: 206
kmayerb commented 3 days ago

With latest version we've changed the reference DB file:

The default is now --- combo_xcr_2024-03-05.tsv

from tcrdist.repertoire import TCRrep
import pandas as pd
data = pd.DataFrame({'v_b_gene':['TRBV5-1*01'], 'cdr3_b_aa':['CASSSSSF']})
tr = TCRrep(cell_df = data, organism = "human", chains = ['beta'])
print(tr.db_file)
print(tr.all_genes.keys())
print(tr.all_genes['human'].keys())