kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

Documentation of meta-clonotype discovery #48

Closed andreas-wilm closed 3 years ago

andreas-wilm commented 3 years ago

Hello there,

this is obviously more a wish or suggested enhancement than an issue, but it would be great if the meta-clonotype discovery section contained (even) more documentation. Most variable names etc. are intuitively understandable, but what for example are min_nsubject (minimum publicity of the meta-clonotype), min_nr (minimum non-redundancy) and bkgd_cntl_nn2()'s include_seq_info? Does min_nsubject=2 for example mean that member sequences have to come from at least two subjects, otherwise the clonotype is discarded?

Many thanks!

kmayerb commented 3 years ago

Really appreciate the request for more info! Thanks for pointing out the missing argument in the function bkgd_cntl_nn2(). The primary source of information is the docstrings and this argument was missing, but it has now been added to the master branch.

bkgd_cntl_nn2 (stands for background controlled nearest neighbors). The purpose of the function is to set an appropriate radius for each TCR, such that the number of neighbors in an antigen naive background is controlled to some desired estimated frequency, and then tabulate how many TCRs within the target antigen enriched dataset are contained within that radius, so that the metaclonotypes can be ranked based on their sensitivity to detect other TCRs with shared antigen recognition. The user may be interested in inspecting the actual sequences that were found. For that include_seq_info, if set to True, stores that information as lists in the returned DataFrame.

Now the docstrings : tcrdist/neighbors.bkgd_cntl_nn2():

"""
include_seq_info : bool 
        If True, returned DataFrame <centers_df> will include ['target_neighbors', 'target_seqs',
        'background_neighbors','background_seqs','background_v', 'background_j']
        as columns in centers_df DataFrame returned by this function. 
        This allows for inspection of sequences found in both the antigen enriched repertoire and supplied
        background.
"""

Furthermore, based on this helpful suggestion we've also updated the docstrings for the function tcrdist/centers.rank_centers.py

"""
    min_nsubject : int
        Default 2, (minimum publicity of the meta-clonotype). 
        That is, the minimum number of unique subjects contributing TCRs 
        among a group of biochemically TCRs to form a meta-clonotype. 
    min_nr : int
        Default 1, (minimum non-redundancy). Once the metaclonotypes are ranked, 
        the function requires that lower ranked meta-clonotypes to have a minimum number
        <min_nr> of new sequences not already spanned by a higher ranked meta-clonotype. 
"""

See also.

    """
    This function takes the output of tcrdist.neighbors.bkgd_cntl_nn2(), 
    a set of scored metaclonotypes (centers - TCRs + radius) 
    and ranks them by chi2 statistics, 
    prioritizing those that include lots of target sequences 
    while minimizing inclusion of background sequences. 

    Parameters
    ----------
    centers_filename : str or None
        User can only provide centers_df or centers_filename but not both
        The filepath to a file containing metaclonotype centers information, generally produced with 
        tcrdist.neighbors.bkgd_cntl_nn2()
    centers_df : DataFrame or None
        User can only provide centers_df or centers_filename but not both.
        The Pandas DataFraem containing metaclonotype centers information, generally produced with 
        tcrdist.neighbors.bkgd_cntl_nn2()
    rank_column : str
        Default : 'chi2joint' (or 'chi2joint' (radius+motif averaged) or chi2re'(using motif only), 'chi2dist' (using radius only) 
    min_nsubject : int
        Default 2, (minimum publicity of the meta-clonotype). 
        That is, the minimum number of unique subjects contributing TCRs 
        among a group of biochemically TCRs to form a meta-clonotype. 
    min_nr : int
        Default 1, (minimum non-redundancy). Once the metaclonotypes are ranked, 
        the function requires that lower ranked meta-clonotypes to have a minimum number
        <min_nr> of new sequences not already spanned by a higher ranked meta-clonotype. 

    Returns
    -------
    df : DataFrame

"""

https://github.com/kmayerb/tcrdist3/blob/845d4e6b473ef191a35463b6af5450fcbf32fbdc/tcrdist/centers.py#L198-L231

andreas-wilm commented 3 years ago

Thanks a lot!