kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

Unexpected distance values for partially duplicated entries #60

Closed andreas-wilm closed 2 years ago

andreas-wilm commented 2 years ago

Hello again,

I see unexpected distance values in my own data, which I was able to reproduce using the standard TCRdist3 example with partially duplicated entries:

df = pd.read_csv("dash.csv")
df = df.loc[0:3]# reduce to a few example entries
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')
assert tr.pw_alpha.shape[0] == df.shape[0]

Distances are as follows:

In: tr.pw_alpha, tr.pw_beta

Out: (array([[  0, 108, 158],
        [108,   0, 158],
        [158, 158,   0]], dtype=int16),
 array([[  0,  69, 183],
        [ 69,   0, 159],
        [183, 159,   0]], dtype=int16))

So far so good. Now I take one entry, duplicate and change it slightly:

# append the last entry and modify
df = df.append(df.iloc[df.shape[0]-1])
df = df.reset_index(drop=True)
df.loc[df.shape[0]-1, 'cdr3_a_aa'] = df.loc[df.shape[0]-1, 'cdr3_a_aa'][:-1] + "A"

The last two entries now have identical cdr3_b_aa, but slightly different cdr3_a_aa entries.

In: df.iloc[2]['cdr3_b_aa'], df.iloc[3]['cdr3_b_aa']
Out: ('CASTGGGAPLF', 'CASTGGGAPLF')
In: df.iloc[2]['cdr3_a_aa'], df.iloc[3]['cdr3_a_aa']
Out: ('CALGSNTGYQNFYF', 'CALGSNTGYQNFYA')

Now let's recompute the distances of this changed df:

tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')
assert tr.pw_alpha.shape[0] == df.shape[0]

And now the distances - or here their index - is off:

In: tr.pw_alpha, tr.pw_beta

Out: (array([[  0,   0, 108, 158],
        [  0,   0, 108, 158],
        [108, 108,   0, 158],
        [158, 158, 158,   0]], dtype=int16),
 array([[  0,   0,  69, 183],
        [  0,   0,  69, 183],
        [ 69,  69,   0, 159],
        [183, 183, 159,   0]], dtype=int16))

Do I misunderstand this?

Many thanks, Andreas

kmayerb commented 2 years ago

Sorry this has caused confusion.

See more details about clone_df vs cell_df in your last issue: https://github.com/kmayerb/tcrdist3/issues/59

Note that distance matrices correspond to TCRrep.clone_df not the raw input.

If your data is perfectly clean with no invalid CDR3 or V-gene names, the option at initialization should help:

deduplicate = False

e.g.,

tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', 
            deduplicate = False)

With deduplicate = False:

assert tr.pw_alpha.shape[0] == df.shape[0]
assert tr.pw_alpha.shape[0] == tr.clone_df.shape[0]

The reason we don't make deduplicate = False the default is that if you have any missing values or invalid V-genes in your input, distance computation will fail.

MORE INFO, WHICH I WILL ADD SHORTLY TO THE READ THE DOCS PAGE:

https://tcrdist3.readthedocs.io/en/latest/inputs.html#critical-information

Once the data is properly formatted, the next step is to connect the data to an instance of the TCRrep class. The header of almost all scripts working with tcrdist3 includes the import statement from tcrdist.repertoire import TCRrep. When a TCRrep instance is initialized, the user must specify some key information along with the input data:

Before proceeding, it is also helpful to understand that each TCRrep instance contains two Pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df. That is, the rows of the cell_df with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df will not match the order in cell_df. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the deduplicate = False option which will allow the cell_df row order to be directly transferred to the clone_df without any row removal.)

andreas-wilm commented 2 years ago

Wow. Again, excellent answer. I was not aware of these details. Thanks so much!

So the aggregation into clone_df is not based on a clonotype definition as such, but on duplicated rows. In other words, if I have lots of annotation in the input df, then two cells with different annotation values, but otherwise identical values will not be aggregated, correct?

In practice I often have a large cell df as input with lots of annotations per cell. Duplicates might be possible depending on the annotations, e.g. two identical TCRs from two different samples, with sample as annotation. What's considered best practice, if I would like to compute their distances and keep those annotations? Would I...:

Many thanks again

kmayerb commented 2 years ago

_Re: So the aggregation into clonedf is not based on a clonotype definition as such, but on duplicated rows. In other words, if I have lots of annotation in the input df, then two cells with different annotation values, but otherwise identical values will not be aggregated, correct?

Re: In practice I often have a large cell df as input with lots of annotations per cell.

Briefly, you can keep everything like this:

df = pd.read_csv("/Users/andreas/Downloads/dash.csv")
df['rank'] = df.index.to_list()
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv', 
            compute_distances = False)
tr.clone_df = tr.clone_df.sort_values('rank')
tr.compute_distances()

Feel free to keep your desired meta-data in the df as long as there are no missing values in any of the columns. There won't be any deduplication because the 'rank' column is unique to each row, but the groupby command is still run so the order will be upset, so resort the clone_df before you compute the distances.

By this method, the only things dropped will be the clones with an invalid V-gene.

You can see what's missing, with something like this

missing = sorted(list(set(df['rank')) - set(tr.clone_df['rank'])))
df.iloc[missing,]

Also with large data, you probably don't want to store all distances, but only those less than or equal to x. And you can spread this out over N cpus

from tcrdist.breadth get_safe_chunk 
tr.cpus = N 
safe_chunk = get_safe_chunk(tr.clone_df.shape[0],tr.clone_df.shape[0] )
tr.compute_sparse_rect_distances(radius = x, chunk_size = safe_chunk)

https://tcrdist3.readthedocs.io/en/latest/sparsity.html?highlight=sparse#sparse-representation

andreas-wilm commented 2 years ago

Awesome. Thank you!

andreas-wilm commented 2 years ago

Just to double check: I assume you meant tr.clone_df = tr.clone_df.sort_values('rank') instead of just tr.clone_df.sort_values('rank')

kmayerb commented 2 years ago

Yes absolutely. thanks. I updated the snippet to reflect this.

tr.clone_df = tr.clone_df.sort_values('rank')

is correct