Closed SheahLinLee closed 1 year ago
Note the difference between the input cell_df and clone_df (that matches the order of the output arrays).
Once the data is properly formatted, the next step is to connect the data to an instance of the TCRrep
class. The header of almost all scripts working with tcrdist3 includes the import statement from tcrdist.repertoire import TCRrep
. When a TCRrep
instance is initialized, the user must specify some key information along with the input data:
organism
specifies the appropriate organism. Either the character string 'human' or 'mouse' must be specified. chains
specifies whether the TCRrep instance will evaluate a single chain or paired chain data. Provide ['alpha']
or ['beta']
to the chains
argument for single-chain analysis. For paired chain analysis, supply ['alpha', 'beta']
. Tcrist3 supports [gamma],
[delta], or
['gamma', 'delta']` as available options as well The organism
and chains
arguments ensure the correct lookup when appending CDR1, CDR2, and CDR2.5 sequences to the input cell_df
DataFrame. To append these germline-encoded CDR sequences, tcrdist3 must recognize the user-supplied V gene names. The package uses IMGT nomenclature and a library of allele-specific reference genes.
cell_df
contains TCR data. At the risk of repeating ourselves, it cannot be stressed enough that only the relevant columns should be passed in the DataFrame to the cell_df
argument. This is critical because a NaN (missing value) in any column will result in the corresponding row being removed from the analysis. cell_df
argument. cell_df
with an unrecognized V gene name will be removed from the final clone_df
. It is possible to see those lines of cell_df
not integrated into clone_df
by calling tr.show_incomplete()
after initialization. Before proceeding, it is also helpful to understand that each TCRrep instance contains two pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df
is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df.
That is, the rows of the cell_df
with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df
will not match the order in cell_df
. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the option deduplicate = False
which will allow the cell_df
row order to be directly transferred to the clone_df
without any row removal.)
Hi Koshlan,
Thank you very much for your detailed reply.
I have a few more questions if thats ok!
Q1 I was able to run tcr distances and radius on my own sample using the tutorial online. For example, from the tr.clone_df, under pmhc_a_aa, "KKSETS" was listed. I just want to know what does "KKSETS" stands for and how was it derived?
Q2 You mention above that rows of the cell_df with identical values are grouped together and the count column is updated. However, when I went to look at my tr.clone_df, the same values was not aggregated (see attached) and was instead listed as neighbours. How can I fix it?
Q3a
For my analysis, I am using tumour samples (antigen enriched but no knowledge of epitopes) and trying to identify similar CDR3s that could identify the same antigen in the same tumour. I tried to use quasi public clones but it says
ValueError: UNFORTUNATELY NO QUASI PUBLIC CLONES WERE FOUND, CONSIDER YOUR QUERY STRINGENCY
I assume that's because all my subjects are the same. Is there another function I can use?
Q3b Related to the above, CDR3 plays the most important part for my analysis, compared to V and J gene. I tried to just calculate the distances for CDR3 using:
dmat = _pw( metric = my_own_metric,
seqs1 = df['cdr3_b_aa'].values,
ncpus=2,
uniqify=True,
use_numba=False)
but instead of my own metric, can I just use the default metric? (if yes, what should I put for metric?). In addition, can I use the output from this for the other functions including radius, metaclonotypes and CDR3 motifs?
Thank you. Sorry so many questions! If its easier, I can email you instead.
Sheah Lin
No need to use a custom metric.
Try setting weights to zero on other CDRs
"""
If want a 'tcrdistances' AND you want control over EVERY parameter.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
compute_distances = False,
db_file = 'alphabeta_gammadelta_db.tsv')
weights_a= {
"cdr3_a_aa" : 3,
"pmhc_a_aa" : 0,
"cdr2_a_aa" : 0,
"cdr1_a_aa" : 0}
weights_b = {
"cdr3_b_aa" : 3,
"pmhc_b_aa" : 0,
"cdr2_b_aa" : 0,
"cdr1_b_aa" : 0}
tr.weights_a = weights_a
tr.weights_b = weights_b
tr.compute_distances()
Hi kmayer,
Thank you for making this tool opensource. I have just started using it and going through each step in https://tcrdist3.readthedocs.io/en/latest/index.html and trying to understand what each step is doing and what the output files for each step means.
I have a few questions. Currently using the 'dash.csv' file.
The df file has 1924 rows, representing 1924 different TCR chains. After calculating the distances, the output file 'tr.pw_alpha' is an array of 1920 x 1920. Looking at the array, (I assume) the array display the distant between TCR1...TCR1924 and TCR1 ....TCR1924. However, the array is 1920 x 1920
** My questions are:
1) why is the 'tr.pw_alpha' array not 1920 x 1920? What happened to the 4 TCRs? 2) what do the additional columns in tr.cell_df and tr.clone_df means? ie: cdr1_a_aa, cdr2_a_aa etc. Are they amino acid sequences? How were they derived?
**
Sorry about the questions. Its the first time I am using it and don't want to misinterpret the results.
Thank you.