Some distances silently not computed (for identical entries)

andreas-wilm commented 2 years ago

Hi there,

I noticed that under certain cirumstances some distances are silently not computed. I assume this happens for identical entries (what exactly is identical here? sum(df.duplicated()) returns 0 in the example below). While it makes sense to save compute time, there should be at least a warning. It's only logical to assume that the output matrix has the same number of rows as the input data frame.

The following is taken straight from the docs:

import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("/Users/andreas/Downloads/dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')

And a

assert tr.pw_alpha.shape[0] == df.shape[0]

fails (1920 != 1924).

Best, Andreas

kmayerb commented 2 years ago

Andreas, thanks for this feedback. I agree we can add more warnings around this point.

However, note the correspondence is between the TCRrep.clone_df not the input passed to the cell_df

assert tr.pw_alpha.shape[0] == tr.clone_df.shape[0]

This occurs, because there are often fewer clones in the clone_df than the cell_df if any of data fields are missing or there are duplicates. If you want to retain the all the rows add a column to the input 'rank' corresponding to the indices. Note the order of the TCRrep.clone_df will not match the input unless deduplicate == False.

Before proceeding, it is also helpful to understand that each TCRrep instance contains two Pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df. That is, the rows of the cell_df with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df will not match the order in cell_df. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the deduplicate = False option which will allow the cell_df row order to be directly transferred to the clone_df without any row removal.)

cell_df contains the input TCR data. Only the relevant columns should be passed in the DataFrame to the cell_df argument. This is critical because a NaN (missing value) in any column will result in the corresponding row being removed from the analysis.
If the user wishes to retain clones identical at the amino acid level but with distinct CDR3 nucleotide junctions, the nucleotide sequence or another unique-valued column should be provided in the DataFrame passed to the cell_df argument.
Finally, remember that any row of cell_df with an unrecognized V gene name will be removed from the final clone_df.

_It is possible to see those lines of cell_df not integrated into clone_df by calling tr.show_incomplete() after initialization.** (Note: Advanced users who wish to add new genes not currently in the tcrdist3 library can do so by modifying the content of the ‘alphabeta_gammadelta_db.tsv ‘ file in the package source code (python3.8/site-packages/tcrdist/db/alphabeta_gammadeltadb.tsv))

andreas-wilm commented 2 years ago

Excellent and exhaustive answer. Thanks so much!

One add-on question: is there a way to relate the entries of clone_df and cell_df, i.e. to know which cell belongs to which clone?

Many thanks!

kmayerb commented 2 years ago

Thanks for the questions. The best way to relate the clone_df to cell_df one-to-one is to include a 'rank' column providing a unique number to each clone in the input. This way the deduplicate step will not agglomerate any rows as each will be unique. (Some rows may be dropped if no valid v-gene is provided). Then one can you an inner join on the 'rank' column.

You can also re sort the clone_df by the rank column before running tr.compute_distances()

Everything is based on the row order in the tr.clone_df at the time you compute distances. You can compute after you sort.

df = pd.read_csv("/Users/andreas/Downloads/dash.csv")
df['rank'] = df.index.to_list()
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', 
            compute_distances = False)
tr.clone_df.sort_values('rank')
tr.compute_distances()

Also, you can always use deduplicate = False is you've pre-checked your input so that there are no NAs or invalid v-genes, but given that tcrdist3 still is missing CDRs for some pseudogenes this will fail unless you are careful.

df = pd.read_csv("/Users/andreas/Downloads/dash.csv")
df['rank'] = df.index.to_list()
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', 
            compute_distances = False, 
            deduplicate = False)
tr.compute_distances()

andreas-wilm commented 2 years ago

Awesome. Thank you!

kmayerb / tcrdist3

Some distances silently not computed (for identical entries) #59