Closed andreas-wilm closed 2 years ago
Sorry this has caused confusion.
See more details about clone_df vs cell_df in your last issue: https://github.com/kmayerb/tcrdist3/issues/59
Note that distance matrices correspond to TCRrep.clone_df not the raw input.
If your data is perfectly clean with no invalid CDR3 or V-gene names, the option at initialization should help:
deduplicate = False
e.g.,
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
deduplicate = False)
With deduplicate = False:
assert tr.pw_alpha.shape[0] == df.shape[0]
assert tr.pw_alpha.shape[0] == tr.clone_df.shape[0]
The reason we don't make deduplicate = False
the default is that if you have any missing values or invalid V-genes in your input, distance computation will fail.
https://tcrdist3.readthedocs.io/en/latest/inputs.html#critical-information
Once the data is properly formatted, the next step is to connect the data to an instance of the TCRrep class. The header of almost all scripts working with tcrdist3 includes the import statement from tcrdist.repertoire import TCRrep. When a TCRrep instance is initialized, the user must specify some key information along with the input data:
Before proceeding, it is also helpful to understand that each TCRrep instance contains two Pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df. That is, the rows of the cell_df with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df will not match the order in cell_df. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the deduplicate = False option which will allow the cell_df row order to be directly transferred to the clone_df without any row removal.)
Wow. Again, excellent answer. I was not aware of these details. Thanks so much!
So the aggregation into clone_df
is not based on a clonotype definition as such, but on duplicated rows. In other words, if I have lots of annotation in the input df, then two cells with different annotation values, but otherwise identical values will not be aggregated, correct?
In practice I often have a large cell df as input with lots of annotations per cell. Duplicates might be possible depending on the annotations, e.g. two identical TCRs from two different samples, with sample as annotation. What's considered best practice, if I would like to compute their distances and keep those annotations? Would I...:
deduplicate = False
(since duplicates were merged already as clonotypes)tr.clone_df
and tr.pw_[alpha|beta]
(and discard the input clonotype df) to make sure distance and clonotype indices match?Many thanks again
_Re: So the aggregation into clonedf is not based on a clonotype definition as such, but on duplicated rows. In other words, if I have lots of annotation in the input df, then two cells with different annotation values, but otherwise identical values will not be aggregated, correct?
Re: In practice I often have a large cell df as input with lots of annotations per cell.
Briefly, you can keep everything like this:
df = pd.read_csv("/Users/andreas/Downloads/dash.csv")
df['rank'] = df.index.to_list()
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
compute_distances = False)
tr.clone_df = tr.clone_df.sort_values('rank')
tr.compute_distances()
Feel free to keep your desired meta-data in the df as long as there are no missing values in any of the columns. There won't be any deduplication because the 'rank' column is unique to each row, but the groupby command is still run so the order will be upset, so resort the clone_df before you compute the distances.
By this method, the only things dropped will be the clones with an invalid V-gene.
You can see what's missing, with something like this
missing = sorted(list(set(df['rank')) - set(tr.clone_df['rank'])))
df.iloc[missing,]
Also with large data, you probably don't want to store all distances, but only those less than or equal to x. And you can spread this out over N cpus
from tcrdist.breadth get_safe_chunk
tr.cpus = N
safe_chunk = get_safe_chunk(tr.clone_df.shape[0],tr.clone_df.shape[0] )
tr.compute_sparse_rect_distances(radius = x, chunk_size = safe_chunk)
https://tcrdist3.readthedocs.io/en/latest/sparsity.html?highlight=sparse#sparse-representation
Awesome. Thank you!
Just to double check: I assume you meant
tr.clone_df = tr.clone_df.sort_values('rank')
instead of just
tr.clone_df.sort_values('rank')
Yes absolutely. thanks. I updated the snippet to reflect this.
tr.clone_df = tr.clone_df.sort_values('rank')
is correct
Hello again,
I see unexpected distance values in my own data, which I was able to reproduce using the standard TCRdist3 example with partially duplicated entries:
Distances are as follows:
So far so good. Now I take one entry, duplicate and change it slightly:
The last two entries now have identical
cdr3_b_aa
, but slightly differentcdr3_a_aa
entries.Now let's recompute the distances of this changed df:
And now the distances - or here their index - is off:
Do I misunderstand this?
Many thanks, Andreas