kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
55 stars 17 forks source link

Information about metaclonotype centers #65

Closed aoki0623mriid2 closed 2 years ago

aoki0623mriid2 commented 2 years ago

Dear kmayerb,

I have some questions about the tables of metaclonotype centers, which was generated as "centers_df" by the "bkgd_cnt1_nn2" function.

In that table, there are columns named "TR", "BR_weighted", "RR_weighted", "OR_weighted" and "chi2dist". What do the numbers in these columns mean? I assume that these mean the count of TCRs included in the metaclonotype in enriched or background repertoire, and odds ratio and chi2 value of cross tabulation. Are my interpretations correct?

Best regards, Hiroyasu

kmayerb commented 2 years ago

Columns are generated in this part of the function bkgd_cntl_nn2 https://github.com/kmayerb/tcrdist3/blob/6bd7e58eed91b317245b4e909c8debc71db92fab/tcrdist/neighbors.py#L275-L293

These are internally computed variable for ranking metaclonotypes from those that are most likely to capture other target antigen-associated TCRs while spanning relatively few "background" TCRs..

"weighted"-refers to using the weighted adjustment to account for a non-uniform sampling of CDR3s from a particular set of V-J gene combinations.

TR, is target rate:

Its number of Target TCRs from the antigen enriched set within the radius over total within that set, with a psuedocount added to avoid zero. For example if you had 100 tetramer-positive TCRs and 8 of them fall within the radius than TR would be (8+1)/(8+92+1) ~ 0.08

BR_weighted, is background rate, number of backgrounds TCRs within the radius divided by total number of clones in the background, but this must be weighted:

    centers_df['BR_weighted'] = [compute_rate(pos=r['bkgd_hits_weighted'], 
                                    neg=n2-r['bkgd_hits_weighted']) for i,r in centers_df.iterrows()]

RR: relative rate

 centers_df['RR_weighted'] = centers_df['TR']/centers_df['BR_weighted']

OR: odds ratio

    centers_df['OR_weighted'] =[compute_odds_ratio(pos=r['target_hits'], 
                                       neg=n1-r['target_hits'], 
                                       bpos=r['bkgd_hits_weighted'], 
                                       bneg= n2-r['bkgd_hits_weighted'], ps = 1) for i,r in centers_df.iterrows()]

'chi2dist' : is Chi-squared statistic with high values indicating high enrichment of target sequences falling within the radius relative to the number of background TCRs falling within the radius. If you choose to compute regex from each centroid and its neighbors you will also see a chi2re

and centers_df['chi2joint'] combines the chi2square based on distance and regex together

centers_df['chi2joint'] = [beta_re  * r['chi2re'] + beta_dist* r['chi2dist'] for _,r in centers_df.iterrows() ]
aoki0623mriid2 commented 2 years ago

Thank you for your kind replies. I understand the meanings of indices for ranking metaclonotype centers.

Thank you so much!