kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

some basic questions on tcrdist3 #72

Closed SheahLinLee closed 1 year ago

SheahLinLee commented 2 years ago

Hi kmayer,

Thank you for making this tool opensource. I have just started using it and going through each step in https://tcrdist3.readthedocs.io/en/latest/index.html and trying to understand what each step is doing and what the output files for each step means.

I have a few questions. Currently using the 'dash.csv' file.

df = pd.read_csv("dash.csv",sep=',')
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')

The df file has 1924 rows, representing 1924 different TCR chains. After calculating the distances, the output file 'tr.pw_alpha' is an array of 1920 x 1920. Looking at the array, (I assume) the array display the distant between TCR1...TCR1924 and TCR1 ....TCR1924. However, the array is 1920 x 1920

** My questions are:

1) why is the 'tr.pw_alpha' array not 1920 x 1920? What happened to the 4 TCRs? 2) what do the additional columns in tr.cell_df and tr.clone_df means? ie: cdr1_a_aa, cdr2_a_aa etc. Are they amino acid sequences? How were they derived?

**

Sorry about the questions. Its the first time I am using it and don't want to misinterpret the results.

Thank you.

kmayerb commented 2 years ago

Note the difference between the input cell_df and clone_df (that matches the order of the output arrays).

Loading Data into a TCRrep Instance

Once the data is properly formatted, the next step is to connect the data to an instance of the TCRrep class. The header of almost all scripts working with tcrdist3 includes the import statement from tcrdist.repertoire import TCRrep. When a TCRrep instance is initialized, the user must specify some key information along with the input data:

The organism and chains arguments ensure the correct lookup when appending CDR1, CDR2, and CDR2.5 sequences to the input cell_df DataFrame. To append these germline-encoded CDR sequences, tcrdist3 must recognize the user-supplied V gene names. The package uses IMGT nomenclature and a library of allele-specific reference genes.

Before proceeding, it is also helpful to understand that each TCRrep instance contains two pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df. That is, the rows of the cell_df with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df will not match the order in cell_df. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the option deduplicate = False which will allow the cell_df row order to be directly transferred to the clone_df without any row removal.)

SheahLinLee commented 2 years ago

Hi Koshlan,

Thank you very much for your detailed reply.

I have a few more questions if thats ok!

Q1 I was able to run tcr distances and radius on my own sample using the tutorial online. For example, from the tr.clone_df, under pmhc_a_aa, "KKSETS" was listed. I just want to know what does "KKSETS" stands for and how was it derived?

Q2 You mention above that rows of the cell_df with identical values are grouped together and the count column is updated. However, when I went to look at my tr.clone_df, the same values was not aggregated (see attached) and was instead listed as neighbours. How can I fix it? tcrdist3

Q3a For my analysis, I am using tumour samples (antigen enriched but no knowledge of epitopes) and trying to identify similar CDR3s that could identify the same antigen in the same tumour. I tried to use quasi public clones but it says ValueError: UNFORTUNATELY NO QUASI PUBLIC CLONES WERE FOUND, CONSIDER YOUR QUERY STRINGENCY I assume that's because all my subjects are the same. Is there another function I can use?

Q3b Related to the above, CDR3 plays the most important part for my analysis, compared to V and J gene. I tried to just calculate the distances for CDR3 using:

dmat = _pw( metric = my_own_metric,
            seqs1 = df['cdr3_b_aa'].values,
            ncpus=2,
            uniqify=True,
            use_numba=False)

but instead of my own metric, can I just use the default metric? (if yes, what should I put for metric?). In addition, can I use the output from this for the other functions including radius, metaclonotypes and CDR3 motifs?

Thank you. Sorry so many questions! If its easier, I can email you instead.

Sheah Lin

kmayerb commented 2 years ago

No need to use a custom metric.

Try setting weights to zero on other CDRs

"""
If want a 'tcrdistances' AND you want control over EVERY parameter.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            compute_distances = False,
            db_file = 'alphabeta_gammadelta_db.tsv')

weights_a= { 
    "cdr3_a_aa" : 3,
    "pmhc_a_aa" : 0,
    "cdr2_a_aa" : 0,
    "cdr1_a_aa" : 0}

weights_b = { 
    "cdr3_b_aa" : 3,
    "pmhc_b_aa" : 0,
    "cdr2_b_aa" : 0,
    "cdr1_b_aa" : 0}

tr.weights_a = weights_a
tr.weights_b = weights_b

tr.compute_distances()