kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
52 stars 16 forks source link

Tcrdist working with sample data but not with my data #94

Closed rutha32 closed 8 months ago

rutha32 commented 8 months ago

tcrdist_alpha_sample.pdf

Hi, tcrdist works fine when I use the sample data (dash.csv), but when I try it with other datasets, I'm getting errors.

These are my columns: 'subject', 'epitope', 'count', 'v_a_gene', 'd_call', 'j_a_gene', 'cdr3_a_aa', 'cdr3_a_nucseq', 'junction', 'decombinator_id', 'rev_comp', 'productive', 'sequence_aa', 'cdr1_aa', 'cdr2_aa', 'chain', 'clone_id', 'time'], dtype='object'

this is the error I get ValueError: zero-size array to reduction operation maximum which has no identity

My code import pandas as pd

file_path = r'C:\Users\pythonProject\ResearchProject\alpha_TCR_all_sample_100.csv'

df = pd.read_csv(file_path)

df.head() from tcrdist.repertoire import TCRrep

tr = TCRrep( cell_df=df, organism='human', chains=['alpha'], db_file='alphabeta_gammadelta_db.tsv' )

pw_alpha = tr.pw_alpha

Thanks

kmayerb commented 8 months ago

The most likely issue is that your V-gene names are not recognized. Do they have allele level resolution? If not, you can add "*01" for approximate result. V-genes must match one of the following values in the id columns --

https://github.com/kmayerb/tcrdist3/blob/master/tcrdist/db/alphabeta_gammadelta_db.tsv

Alternatively you can define cdr1_a_aa, cdr2_a_aa, pmhc_a_aa your self instead of using TCRdist initialization to infer them:

see infer_cdrs = False.

https://github.com/kmayerb/tcrdist3/blob/55d906b19e4c5038f5fdde843eb2edf8293efd88/tcrdist/repertoire.py#L14-L69

Can you provide 10 lines of your input data?

On Thu, Nov 2, 2023 at 1:31 PM rutha32 @.***> wrote:

Hi, tcrdist works fine when I use the sample data, but when I try it with other datasets, I'm getting errors. These are my columns: 'subject', 'epitope', 'count', 'v_a_gene', 'd_call', 'j_a_gene', 'cdr3_a_aa', 'cdr3_a_nucseq', 'junction', 'decombinator_id', 'rev_comp', 'productive', 'sequence_aa', 'cdr1_aa', 'cdr2_aa', 'chain', 'clone_id', 'time'], dtype='object'

this is the error I get ValueError: zero-size array to reduction operation maximum which has no identity

My code import pandas as pd Define the file path

file_path = r'C:\Users\pythonProject\ResearchProject\alpha_TCR_all_sample_100.csv' Read the CSV file into a DataFrame

df = pd.read_csv(file_path) Display the first few rows of the DataFrame

df.head() from tcrdist.repertoire import TCRrep Assuming you've already loaded your data into the 'df' DataFrame

tr = TCRrep( cell_df=df, organism='human', chains=['alpha'], db_file='alphabeta_gammadelta_db.tsv' ) Calculate pairwise distances for the alpha chain

pw_alpha = tr.pw_alpha

Thanks

— Reply to this email directly, view it on GitHub https://github.com/kmayerb/tcrdist3/issues/94, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALD2PVZP2PPBN5AC6CGPZQTYCP7IZAVCNFSM6AAAAAA63PBZC6VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TIOJYGY4TGNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rutha32 commented 8 months ago

Hi thanks for the reply, I got it working when I added the "*01". I removed the some of the columns and only kept the core columns count , v_a_gene, j_a_gene and cdr3_a_aa.