kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

Question: Combined distance matrix for alpha and beta #56

Closed andreas-wilm closed 3 years ago

andreas-wilm commented 3 years ago

Dear Koshlan and TCRdist3 devs,

I would like to know whether it is possible to get a distance matrix from TCRdist3 that represents the combined distance of the alpha and beta chains. It's clear from the documentation how to get pairwise beta and alpha distances (tr.pw_alpha and tr.pw_beta resp), but I was thinking of one score (in case of paired input data) as shown in this TCRdist2 figure. Or can I simply add/average the two and that's it?

Apologies in advance if this clear from the documentation or if I misunderstood this.

Many thanks, Andreas

kmayerb commented 3 years ago

Andreas,

Great question, sorry for the slower than average response:

Assuming you have full matrices and not sparse ones, combined Alpha and Beta TCRdists into one distance is as simple as

tr.pw_alpha + tr.pw_beta

However, if you have more than 10K clones, you might be using the sparse representation that will drop all values less than some pre specified value, you will need to be careful combining those: Try this, method shown below, but this is not a fully supported feature yet. Let me know if this resolves your question?

from scipy import sparse
def add_sparse_pwd(a, b, neg1_fmt=False):
    """
    Add sparse pairwise distance matrices a and b.
    Can work with two "formats": true zero distances as -1
    or true zero distances as 0. Returns same format as received.
    Does not attempt to add sparse zeros (ie NA),
    so returns an intersection of data in a and b (summed).
    Returns a scipy.sparse.csr matrix (can take csr or any
    format as inputs).
    Parameters
    ----------
    a, b : scipy.sparse.csr or coo
        Sparse pairwise distance matrices. D=0 can be 
        represented as -1 or 0 but should be specified
        using neg1_fmt
    neg1_fmt : bool
        Indicator whether to assume D=0 is represented as -1
    Returns
    -------
    tot : scipy.sparse.csr
        Sparse matrix representing sum of a and b
        where a and b both have values.
    """

    a_coo = a.tocoo()
    b_coo = b.tocoo()
    nr, nc = a_coo.shape

    ind_a = a_coo.row.astype(np.int64) * nc + a_coo.col
    ind_b = b_coo.row.astype(np.int64) * nc + b_coo.col
    # ind_ab = np.array(tuple(set(ind_a).intersection(set(ind_b))))
    ind_ab = np.intersect1d(ind_a, ind_b, assume_unique=True)
    keep_a = np.in1d(ind_a, ind_ab, assume_unique=True)
    keep_b = np.in1d(ind_b, ind_ab, assume_unique=True)
    if neg1_fmt:
        a_data = a_coo.data[keep_a]
        a_data[a_data == -1] = 0
        b_data = b_coo.data[keep_b]
        b_data[b_data == -1] = 0
        tot = a_data + b_data
    else:
        tot = a_coo.data[keep_a] + b_coo.data[keep_b]
    """Need to add -1 as placeholder upon creation of new sparse matrix"""
    tot[tot == 0] = -1
    aplusb = sparse.coo_matrix((tot,
                                (a_coo.row[keep_a], a_coo.col[keep_a])), shape=(nr, nc))
    aplusb = aplusb.tocsr()
    if not neg1_fmt:
        """Convert it back to 0 = true zero distance"""
        aplusb.data[aplusb.data == -1] = 0
    return aplusb
​
​
​
## TEST 
​
import numpy as np
from scipy.sparse import csr_matrix
​
A = np.matrix([ [1, 2, 3], 
                [3, 4, 0]])
B = np.matrix([ [9, 8, 0], 
                [0, 1, 0]])
​
As = csr_matrix(A)
Bs = csr_matrix(B)
Cs = As + Bs
# Regular Addition
Cs.todense()
np.all(Cs.todense() == A + B)
print(Cs.todense())
​
​
# TCRDIST ADDITION OF SPARES (ANY ZERO STAY ZEROS?)
Ns = add_sparse_pwd(As, Bs)
print(Ns.todense())
andreas-wilm commented 3 years ago

Excellent. Many thanks for the detailed answer and solution. I forgot about the sparse matrix representation. I will use this code as-is. It would be great to have this part of the TCRdist3 functionality at some point :) Many thanks again

kmayerb commented 2 years ago

Now please see a function for add_sparse_pwd (adding sparse pairwise distances) is incorporated into the package after version 0.2.1

from tcrdist.sparse import add_sparse_pwd
tr_search.rw_alpha_beta =  add_sparse_pwd(tr_search.rw_beta,tr_search.rw_alpha)
KJMaroney commented 1 year ago

Hello, I am very new to python. Can you help me out? I just kind of ran your default from your documentation. How would I actually use the function you generously created in Oct 21, 2021 to create the matrix and export it? Sorry if this is a stupid question (I'm sure it is). I just need a matrix for the paired a/B TCR's and the dataframe to link back the original families and sequences. I have been trying to follow your documentation which is very thorough, but I must admit that being such a novice with Python is hindering me a bit. Thank you for any help you can offer.

kmayerb commented 1 year ago

First output the clone_df which is the same row dimensions as the matrix

tr.clone_df.to_csv('your_clones.csv', index = False)

Next if you want paired A/B distance

construct a paired matrix, by matrix addition

tr.pw_alpha_beta = tr.pw_beta + tr.pw_alpha

to save matrix to a csv, rows match the order in the clone_df !!!

tr.pw_alpha_beta.tofile('youroutput.csv', sep = ',')

to save as numpy matrix as other common formats:

# https://numpy.org/doc/stable/reference/generated/numpy.save.html#numpy.save tr.pw_alpha_beta.save('youroutput.npy', x) tr.pw_alpha_beta.savez('youroutput.npz', x) tr.pw_alpha_beta.savetxt('test.out', x, delimiter=',')

does this do what you are trying to accomplish?

On Mon, Jan 30, 2023 at 10:43 AM KJMaroney @.***> wrote:

Hello, I am very new to python. Can you help me out? I just kind of ran your default from your documentation. How would I actually use the function you generously created in Oct 21, 2021 to create the matrix and export it? Sorry if this is a stupid question (I'm sure it is). I just need a matrix for the paired a/B TCR's and the dataframe to link back the original families and sequences. I have been trying to follow your documentation which is very thorough, but I must admit that being such a novice with Python is hindering me a bit. Thank you for any help you can offer.

— Reply to this email directly, view it on GitHub https://github.com/kmayerb/tcrdist3/issues/56#issuecomment-1409146073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALD2PV7YPPF66A3XP4EIDMTWVADWBANCNFSM5CWAB7MA . You are receiving this because you commented.Message ID: @.***>

KevinMaroney commented 1 year ago

Yes! I was then able to name it with this: pw_alpha_beta_named = pd.DataFrame(df1, tr.clone_df["clone_id"], tr.clone_df["clone_id"]).

I had to export this to csv and restructure the array into a 3-column list to be read by Cytoscape. I'm sure there is a more efficient way to do this, but it works for my purpose so I can use metadata in Cytoscape. Thank you for your help.

kmayerb commented 1 year ago

Glad to help!

On Wed, Feb 8, 2023 at 9:50 AM Kevin Maroney @.***> wrote:

Yes! I was then able to name it with this: pw_alpha_beta_named = pd.DataFrame(df1, tr.clone_df["clone_id"], tr.clone_df["clone_id"]).

I had to export this to csv and restructure the array into a 3-column list to be read by Cytoscape. I'm sure there is a more efficient way to do this, but it works for my purpose so I can use metadata in Cytoscape. Thank you for your help.

— Reply to this email directly, view it on GitHub https://github.com/kmayerb/tcrdist3/issues/56#issuecomment-1423013091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALD2PV6FTI42UAKQG4N3JOTWWPMGBANCNFSM5CWAB7MA . You are receiving this because you commented.Message ID: @.***>

CaryBond commented 6 months ago

hello, can i ask u a question? i just found that in my dataset, the alpha chain data of one species doesnt have the V and J gene information, after i calculated the tcrdistances, there was an error: zero-size array to reduction operation maximum which has no identity. is this caused by the absence of the V and J sequences? if so, how could i calculated the tcrdistances for the alpha chain of that species? id appreciate it if u can give me some help with this.