Unexpected distance values for partially duplicated entries

andreas-wilm commented 2 years ago

Hello again,

I see unexpected distance values in my own data, which I was able to reproduce using the standard TCRdist3 example with partially duplicated entries:

df = pd.read_csv("dash.csv")
df = df.loc[0:3]# reduce to a few example entries
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')
assert tr.pw_alpha.shape[0] == df.shape[0]

Distances are as follows:

In: tr.pw_alpha, tr.pw_beta

Out: (array([[  0, 108, 158],
        [108,   0, 158],
        [158, 158,   0]], dtype=int16),
 array([[  0,  69, 183],
        [ 69,   0, 159],
        [183, 159,   0]], dtype=int16))

So far so good. Now I take one entry, duplicate and change it slightly:

# append the last entry and modify
df = df.append(df.iloc[df.shape[0]-1])
df = df.reset_index(drop=True)
df.loc[df.shape[0]-1, 'cdr3_a_aa'] = df.loc[df.shape[0]-1, 'cdr3_a_aa'][:-1] + "A"

The last two entries now have identical cdr3_b_aa, but slightly different cdr3_a_aa entries.

In: df.iloc[2]['cdr3_b_aa'], df.iloc[3]['cdr3_b_aa']
Out: ('CASTGGGAPLF', 'CASTGGGAPLF')
In: df.iloc[2]['cdr3_a_aa'], df.iloc[3]['cdr3_a_aa']
Out: ('CALGSNTGYQNFYF', 'CALGSNTGYQNFYA')

Now let's recompute the distances of this changed df:

tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')
assert tr.pw_alpha.shape[0] == df.shape[0]

And now the distances - or here their index - is off:

In: tr.pw_alpha, tr.pw_beta

Out: (array([[  0,   0, 108, 158],
        [  0,   0, 108, 158],
        [108, 108,   0, 158],
        [158, 158, 158,   0]], dtype=int16),
 array([[  0,   0,  69, 183],
        [  0,   0,  69, 183],
        [ 69,  69,   0, 159],
        [183, 183, 159,   0]], dtype=int16))

Do I misunderstand this?

Many thanks, Andreas

kmayerb commented 2 years ago

Sorry this has caused confusion.

See more details about clone_df vs cell_df in your last issue: https://github.com/kmayerb/tcrdist3/issues/59

Note that distance matrices correspond to TCRrep.clone_df not the raw input.

If your data is perfectly clean with no invalid CDR3 or V-gene names, the option at initialization should help:

deduplicate = False

e.g.,

tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', 
            deduplicate = False)

With deduplicate = False:

assert tr.pw_alpha.shape[0] == df.shape[0]
assert tr.pw_alpha.shape[0] == tr.clone_df.shape[0]

The reason we don't make deduplicate = False the default is that if you have any missing values or invalid V-genes in your input, distance computation will fail.

MORE INFO, WHICH I WILL ADD SHORTLY TO THE READ THE DOCS PAGE:

https://tcrdist3.readthedocs.io/en/latest/inputs.html#critical-information

Tcrdist3 only requires 3 input columns for single chain analysis (i.e., for beta chain cdr3_b_aa and v_b_gene, j_b_gene) and 6 columns for paired chain analysis (i.e., cdr3_b_aa, v_b_gene, j_b_gene, cdr3_a_aa, v_a_gene, and j_a_gene). More columns can be included depending on the application.
An optional count column will track the abundance of each clone. If no count column is provided all clones are assigned a count of 1. Additional columns can be included if the user intends to distinguish input rows with donor information or epitope specificity annotations. The names of these columns are not pre-defined (e.g., subject, cell_type, visit, epitope)
By default, tcrdist3 uses all supplied columns provided in the DataFrame passed to the cell_df argument to look for potentially duplicated rows based on the default setting (deduplicate = True). The initialization of a TCRrep instance automatically aggregates counts over duplicated rows.
This has practical consequences. For instance, if no subject column is included identical clones from two or more individuals will be combined into a single row.
If any columns have missing values, the corresponding row containing the missing value is excluded. Thus, do not include columns that have missing values. If you wish to retain every clonotype, adding an index column or the nucleotide sequence will prevent rows with identical amino acid sequences from being merged.

Once the data is properly formatted, the next step is to connect the data to an instance of the TCRrep class. The header of almost all scripts working with tcrdist3 includes the import statement from tcrdist.repertoire import TCRrep. When a TCRrep instance is initialized, the user must specify some key information along with the input data:

organism specifies the appropriate organism. Either the character string 'human' or 'mouse' must be specified.
chains specifies whether the TCRrep instance will evaluate a single chain or paired chain data. Provide ['alpha'] or ['beta'] to the chains argument for single-chain analysis. For paired chain analysis, supply ['alpha', 'beta']. Tcrist3 supports ['gamma'],[ 'delta'], or ['gamma', 'delta'] as available options as well. The organism and chains arguments ensure the correct lookup when appending CDR1, CDR2, and CDR2.5 sequences to the input cell_df DataFrame. To append these germline-encoded CDR sequences, tcrdist3 must recognize the user-supplied V gene names. The package uses IMGT nomenclature and a library of allele-specific reference genes.
cell_df contains the input TCR data. Only the relevant columns should be passed in the DataFrame to the cell_df argument. This is critical because a NaN (missing value) in any column will result in the corresponding row being removed from the analysis.
If the user wishes to retain clones identical at the amino acid level but with distinct CDR3 nucleotide junctions, the nucleotide sequence or another unique-valued column should be provided in the DataFrame passed to the cell_df argument.
Finally, remember that any row of cell_df with an unrecognized V gene name will be removed from the final clone_df. It is possible to see those lines of cell_df not integrated into clone_df by calling tr.show_incomplete() after initialization. (Note: Advanced users who wish to add new genes not currently in the tcrdist3 library can do so by modifying the content of the ‘alphabeta_gammadelta_db.tsv ‘ file in the package source code (python3.8/site-packages/tcrdist/db/alphabeta_gammadelta_db.tsv))

Before proceeding, it is also helpful to understand that each TCRrep instance contains two Pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df. That is, the rows of the cell_df with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df will not match the order in cell_df. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the deduplicate = False option which will allow the cell_df row order to be directly transferred to the clone_df without any row removal.)

andreas-wilm commented 2 years ago

Wow. Again, excellent answer. I was not aware of these details. Thanks so much!

So the aggregation into clone_df is not based on a clonotype definition as such, but on duplicated rows. In other words, if I have lots of annotation in the input df, then two cells with different annotation values, but otherwise identical values will not be aggregated, correct?

In practice I often have a large cell df as input with lots of annotations per cell. Duplicates might be possible depending on the annotations, e.g. two identical TCRs from two different samples, with sample as annotation. What's considered best practice, if I would like to compute their distances and keep those annotations? Would I...:

first aggregate the cell df into a clonotype df using my own clonotype definition, keeping counts and e.g. add contributing samples and cells as list to a merged clonotype entry
then use that as input for TCRrep and for good measure use deduplicate = False (since duplicates were merged already as clonotypes)
and after distance computation use tr.clone_dfand tr.pw_[alpha|beta] (and discard the input clonotype df) to make sure distance and clonotype indices match?

Many thanks again

kmayerb commented 2 years ago

_Re: So the aggregation into clonedf is not based on a clonotype definition as such, but on duplicated rows. In other words, if I have lots of annotation in the input df, then two cells with different annotation values, but otherwise identical values will not be aggregated, correct?

clonotype aggregation will combine anything with all identical values across all columns, sequence feature or metadata. Any unique metadata such as subject-name will prevent agglomeration. This was based on the legacy idea of collapsing clones identical at AA level, but it now creates considerable confusion.

Re: In practice I often have a large cell df as input with lots of annotations per cell.

Briefly, you can keep everything like this:

df = pd.read_csv("/Users/andreas/Downloads/dash.csv")
df['rank'] = df.index.to_list()
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', 
            compute_distances = False)
tr.clone_df = tr.clone_df.sort_values('rank')
tr.compute_distances()

Feel free to keep your desired meta-data in the df as long as there are no missing values in any of the columns. There won't be any deduplication because the 'rank' column is unique to each row, but the groupby command is still run so the order will be upset, so resort the clone_df before you compute the distances.

By this method, the only things dropped will be the clones with an invalid V-gene.

You can see what's missing, with something like this

missing = sorted(list(set(df['rank')) - set(tr.clone_df['rank'])))
df.iloc[missing,]

Also with large data, you probably don't want to store all distances, but only those less than or equal to x. And you can spread this out over N cpus

from tcrdist.breadth get_safe_chunk 
tr.cpus = N 
safe_chunk = get_safe_chunk(tr.clone_df.shape[0],tr.clone_df.shape[0] )
tr.compute_sparse_rect_distances(radius = x, chunk_size = safe_chunk)

https://tcrdist3.readthedocs.io/en/latest/sparsity.html?highlight=sparse#sparse-representation

andreas-wilm commented 2 years ago

Awesome. Thank you!

andreas-wilm commented 2 years ago

Just to double check: I assume you meant tr.clone_df = tr.clone_df.sort_values('rank') instead of just tr.clone_df.sort_values('rank')

kmayerb commented 2 years ago

Yes absolutely. thanks. I updated the snippet to reflect this.

tr.clone_df = tr.clone_df.sort_values('rank')

is correct

kmayerb / tcrdist3

Unexpected distance values for partially duplicated entries #60

MORE INFO, WHICH I WILL ADD SHORTLY TO THE READ THE DOCS PAGE: