theislab / single-cell-best-practices

https://www.sc-best-practices.org
https://www.sc-best-practices.org
Other
746 stars 174 forks source link

Scaling TCR Dist #195

Open MiThoSan opened 1 year ago

MiThoSan commented 1 year ago

Dear Theis Lab,

I am following your excellent repository for using TCRdist. The instructions work perfectly fine by using a subset of my TCRs however, by exceeding 10'000 clones it is suggested to use a Sparse Representation (https://tcrdist3.readthedocs.io/en/0.2.0/sparsity.html). How can I handle the resulting compressed sparse matrix to also get a clonotype_network representation of my dataset?

Zethson commented 1 year ago

Dear @MiThoSan

thank you very much for the positive feedback! This project is by more than just Theislab even though it's hosted in our Github organization.

What have you tried so far and where do things go wrong? This might also be a question for the developers of tcrdist3 and not us...

MiThoSan commented 1 year ago

Many thanks for your fast response!

I tried to use a subset of my TCR data and followed your instructions and everything works as expected. However, if I exceed 10'000 TCRs in my data I receive the following message by running TCRdist. Input: tr = TCRrep(cell_df=df_tcrdist, organism="human", chains=["alpha", "beta"])

Resulting error message:

When TCRrep. size 27136 > 10,000. TCRrep.compute_distances() may be called explicitly by a user with knowledge of system memory availability. However, it's HIGHLY unlikely that you want to compute such a large numpy array. INSTEAD, if you want all pairwise distance, you will likely want to set an appropriate number of cpus with TCRrep.cpus = x, and then generate a scipy.sparse csr matrix of distances with: TCRrep.compute_sparse_rect_distances(radius=50, chunk_size=100), leaving df and df2 arguments blank. When you do this the results will be stored as TCRrep.rw_beta instead of TCRrep.pw_beta. This function is highly useful for comparing a smaller number of sequences against a bulk set In such a case, you can specify df and df2 arguments to create a non-square matrix of distances. See https://tcrdist3.readthedocs.io/en/latest/sparsity.html?highlight=sparse for more info.

warnings.warn(f"\n\nWhen TCRrep. size {self.clone_df.shape[0]} > 10,000.\n"

Following the link https://tcrdist3.readthedocs.io/en/latest/sparsity.html allows me to perform the analysis as indicated: tr = TCRrep(cell_df=df_tcrdist, organism="human", chains=["alpha", "beta"], compute_distances = False) tr.cpus = 2 tr.compute_sparse_rect_distances(radius = 50, chunk_size = 100)

However, I think my main problem is the correct handling with the resulted sparse representation. The following step of your script (with small adjustments pw_alpha -> rw_alpha) resulted in an error message for the last line of code: Input: dist_total = tr.rw_alpha + tr.rw_beta columns = tr.clone_df["index"].astype(float).astype(int) df_dist = pd.DataFrame(dist_total, columns=columns, index=columns)

Error message: ValueError: Shape of passed values is (27136, 1), indices imply (27136, 27136)

Your help is highly appreciated.

On 20 May 2023, at 18:57, Lukas Heumos @.***> wrote:

Dear @MiThoSanhttps://github.com/MiThoSan

thank you very much for the positive feedback! This project is by more than just Theislab even though it's hosted in our Github organization.

What have you tried so far and where do things go wrong? This might also be a question for the developers of tcrdist3 and not us...

— Reply to this email directly, view it on GitHubhttps://github.com/theislab/single-cell-best-practices/issues/195#issuecomment-1555951741, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A76XNXFZZDDPHSQQB5UDVEDXHDZWVANCNFSM6AAAAAAYHQPLLI. You are receiving this because you were mentioned.Message ID: @.***>

Zethson commented 1 year ago

We can't do anything about this when their return is inconsistent. I am afraid that you'll have to open an issue over there and ask for guidance.

drEast commented 1 year ago

@MiThoSan If this is still relevant: "compute_sparse_rect_distances" calculates the distances between 1 TCR to all of a reference set, which can be used quite efficiently for database queries (resulting in distances (len_atlas x len_query). However, it does not compute the pairwise distances used here.

I faced the scaling issue once myself, and had the following snippet lying around to solve this:

tr = TCRrep(cell_df=df,
                organism='human',
                chains=['alpha', 'beta'],
                compute_distances=False,
                deduplicate=False,
                db_file='alphabeta_gammadelta_db.tsv')
tr.compute_distances()

I will test this, and add it to the notebook with a warning not to use too large datasets (10k-15k still worked on my laptop). As this is the more general case and many people might have more than 10k clones, it should be handled in the book. Thanks for pointing this out