kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

ValueError: zero-size array to reduction operation maximum which has no identity #57

Closed Lihua1990 closed 3 years ago

Lihua1990 commented 3 years ago

Hi, I am using tcrdist3 and I encountered the error message as titled in this issue.

here's my dataframe:

df.head()

| subject | count | v_d_gene | d_d_gene | j_d_gene | cdr3_d_aa | cdr3_d_nucseq | -- | -- | -- | -- | -- | -- | -- CB2**afs2 | 369 | TRDV2 | . | TRDJ1 | CACDTGGYTDKLIF | TGTGCCTGTGACACTGGGGGATACACCGATAAACTCATCTTT CB2**afs2 | 335 | TRDV2 | . | TRDJ1 | CACDTGGYTDKLIF | TGTGCCTGTGACACCGGGGGATACACCGATAAACTCATCTTT CB2**afs2 | 214 | TRDV2 | . | TRDJ3 | CACDWGSSWDTRQMFF | TGTGCCTGTGACTGGGGGAGCTCCTGGGACACCCGACAGATGTTTTTC CB2**afs2 | 214 | TRDV2 | TRDD3 | TRDJ1 | CACDILGDTDKLIF | TGTGCCTGTGACATACTGGGGGACACCGATAAACTCATCTTT CB2**afs2 | 200 | TRDV2 | . | TRDJ3 | CACDTWGSSWDTRQMFF | TGTGCCTGTGACACCTGGGGGAGCTCCTGGGACACCCGACAGATGT...

and running the following will return an error

import pandas as pd
from tcrdist.repertoire import TCRrep

tr = TCRrep(cell_df = df, 
            organism = 'human', 
            chains = ['delta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')

ValueError: zero-size array to reduction operation maximum which has no identity.

What might be the problem and what should I check?

Thank you in adcance!

Lihua

kmayerb commented 3 years ago

Lihua,

This error is due to the input not being correctly formatted. So there are no distances to compute: V and J gene names must have allele number so that we can infer CDR1, CDR2, CRD2.5 (aka PMHC). This can easily be fixed in your input df

df['v_d_gene'] = df['v_d_gene'].apply(lambda  x : f"{x}*01")
df['j_d_gene'] = df['j_d_gene'].apply(lambda  x : f"{x}*01")

also make sure to only pass in relevant columns as a NA in any column of the cell df will cause you to lose that row.

tr = TCRrep(cell_df = df[['subject','cdr3_d_aa','v_d_gene','j_d_gene','count']], 
            organism = 'human', 
            chains = ['delta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')
Lihua1990 commented 3 years ago

Hi,

Thank you so much for the reply. I still have another question, you mentioned that a NA value in any column of the cell of the dataframe will cause to lose that row. In my dataframe, there are 10% of the rows that have NA value in the column 'd_d_gene', other 90% of the d_d_gene do have a defined value, such as 'TRDD3', 'TRDD2' or 'TRDD1'. What do you suggest that I deal with this 'd_d_gene' column? Should I convert all those have a defined value to df['d_d_gene'] = df['d_d_gene'].apply(lambda x : f"{x}*01")? Is there a way to also include those rows that do have a NA value in the 'd_d_gene' column?

Thank you so much!

Best, Lihua

kmayerb commented 3 years ago

D genes are not used by tcrdist3, so you should not include that column when you initialize the TCRrep instance.

You can specify only the columns you need here:

tr = TCRrep(cell_df = df[['subject','cdr3_d_aa','v_d_gene','j_d_gene','count']], 
            organism = 'human', 
            chains = ['delta'], 
            db_file = 'alphabeta_gammadelta_db.tsv',)
Lihua1990 commented 3 years ago

D genes are not used by tcrdist3, so you should not include that column when you initialize the TCRrep instance.

You can specify only the columns you need here:

tr = TCRrep(cell_df = df[['subject','cdr3_d_aa','v_d_gene','j_d_gene','count']], 
            organism = 'human', 
            chains = ['delta'], 
            db_file = 'alphabeta_gammadelta_db.tsv',)

OK, clear now, thanks a lot!