scverse / scirpy

A scanpy extension to analyse single-cell TCR and BCR data.
https://scirpy.scverse.org/en/latest/
BSD 3-Clause "New" or "Revised" License
219 stars 34 forks source link

Memory usage ir_neighbors #217

Closed vladie0 closed 3 years ago

vladie0 commented 3 years ago

Dear authors,

Currently, the ir_neighbors algorithm consumes over 100G of memory of data from 55k cells, causing memory failures on the server. Is there any way to limit memory consumption and is this normal behavior, which parameters can I adapt to control memory usage ?

Kind Regards,

grst commented 3 years ago

Hi @vladie0,

what metric and what options do you use for ir_neighbors? For only 55k cells this seems like a lot!

There are several points to be considered here:

Best, Gregor

vladie0 commented 3 years ago

Dear Gregor,

Thank you for your swift reply. I've reduced the number of edges to 2, but still have the same issue. The ouput is as you can see below, it is stuck at this stage, memory usage ramps up till 128gb than system fails.

image

Ar there other parameter that can be of importance, for example metric, n_jobs, sequence? ...

Note, I've also merged TCR & adata data, prior to ir_neighbors calculations. I've tried the following options without succes:

ir.pp.ir_neighbors(adata) ir.pp.ir_neighbors(adata, receptor_arms="all", dual_ir="primary_only") ir.pp.ir_neighbors( adata, metric="alignment", n_jobs=1, sequence="aa", cutoff=2, receptor_arms="any", dual_ir="primary_only", )

ir.pp.ir_neighbors( adata, metric="levenshtein", n_jobs=8, sequence="aa", cutoff=10, receptor_arms="any", dual_ir="primary_only", )

vladie0 commented 3 years ago

Even with alignment="identity", the same memory issue persists

vladie0 commented 3 years ago

This issue keeps occurring even when the number of cells are down sampled to 10000 cells.

grst commented 3 years ago

So for this, I would indeed expect memory issues:

ir.pp.ir_neighbors(
adata,
metric="levenshtein",
n_jobs=8,
sequence="aa",
cutoff=10,
receptor_arms="any",
dual_ir="primary_only",
)

A levenshtein distance of 10 will connect almost all cells - A threshold of 1 or 2 makes more sense here, 10 is better for the alignment metric.

For the others, in particular ir.pp.ir_neighbors(adata), this should not happen.

Just to be sure, when testing the identity metric, and the downsampled data, did you restart the Python kernel to make sure you start with a clean memory?


Let's also do some checks on your anndata. Can you please execute the following commands, they will count the most frequent CDR3 sequences:

adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head()
vladie0 commented 3 years ago

Yes, memory was cleaned each time. I performed your request, this was calculated instantly

ir_req

grst commented 3 years ago

Ah wow, that explains. You essentially only have three different clonotypes in your data, with 10k+ cells each. Is that what you expect from your data?

This will lead to three "dense" blocks of 12931 x 12931 + 13784 x 13784 + ... in the sparse matrix, which is obviously a lot of memory.

This will be fixed by #191, but I'll still take some time to implement it.

On the other hand, with only three different clonotypes, the clonotype network does not make a lot of sense anyway.

EDIT: as it's only showing the top 5, there might still be some more with low counts. In that case it would still make sense to build the network.

A "workaround" would be to downsample only the cells that belong to that abundant clonotype -- or just live with the memory consumption if your system can handle it.

vladie0 commented 3 years ago

Ok thank you, there are 267 clonotypes, but indeed with a lower presence. I'm quite new to TCR so I'm still trying to figure things out. I will certainly lookout for the next updates, thank you for your aid and the development of an awesome package.

Would you suggest to remove the top 3 clonotype and build a network on this data subset or would you still consider it unnecessary to define clonotypes using scirpy and carry on with subsequent analysis?

grst commented 3 years ago

First I would wonder: is there a biological explanation for the clonotypes being so skews, or is this a technical artifact? I have never seen such an unequal distribution of clonotypes before, they usually follow a Power Law distribution.

I think it is still interesting to look at a clonotype network. As a temporary workaround, you could remove all cells with these abundant clonotypes, except a few representatives. Or if you have the memory, just perform the analysis as usual.

vladie0 commented 3 years ago

Yes, these are cell lines of T-cells harvested from a patient, with a certain disease. I suppose this explained the skewed distribution.

I will try to implement the workaround, thank your support

vladie0 commented 3 years ago

One more question, I've received a 1T mem machine to perform calcuation on the full DB, what exact parameters would you suggest to use to perform ir_neighbors?

grst commented 3 years ago

I've received a 1T mem machine to perform calcuation on the full DB,

fingers crossed that works out :crossed_fingers:

what exact parameters would you suggest to use to perform ir_neighbors?

If you want to construct the network based on "similar clonotypes that likely recognize the same antigen", I would go for the alignment metric and a cutoff of somewhere between 10 and 15 (we don't have empirical evidence to back up an ideal threshold).

Since you have so few clonotypes, I would probably use dual_ir = "any" and receptor_arms = "all" (possibly even receptor_arms = "any")

It depends a bit what your question is. The options are also explained here.

nicoac commented 3 years ago

I'm running into a similar issue with a dataset generated from T cells with a fixed beta chain. Seems as if the computation gets stuck and this is what it looks like: 0%| | 0/669 [00:00<?, ?it/s]

Looking forward to the fix! I'm really enjoying using Scirpy.

grst commented 3 years ago

Hi @nicoac,

to verity that it is really the same issue, could you please also report the result of the following two command?

adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head()
nicoac commented 3 years ago

Thanks for the response. @grst Here is the result of that code:

IR_VDJ_1_cdr3 CASSSPGTANYAEQFF 25647 nan 9079 None 242 CTCSAEGDRQAPLF 6 CAWSLGQQNTLYF 6 dtype: int64

As you can see. It is quite a large amount of one specific chain. Also just to be clear. This is the result of combining six different adata frames. I have three day zeros and three day 8 samples that I am tracking longitudinally to see TCR alpha chain pairings over time.

grst commented 3 years ago

Thanks for checking, seems indeed the same issue. I'll try to fix it after the holidays.

nicoac commented 3 years ago

Hi Gregor. Just following up with you on some updates. I was able to get the numbers to crunch properly all the way through the ir_neighbors and the end of the analysis as per the tutorial. However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables.

Any thoughts?

grst commented 3 years ago

However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables.

I'm not sure what you mean with the coord-dictionary and what exactly is getting wiped. Can you provide an example?

nicoac commented 3 years ago

scirpy_error

Here is what is happening after a fresh run of the code

grst commented 3 years ago

What probably happens is that the jupyter kernel gets killed because it runs out of memory (it seems pp.ir_neighbors never finished). As the jupyter kernel restarts, all your objects get lost.

Unfortunately, there's currently not a lot you can do except for

I'm working on a fix in #230, but it turns out to be more difficult than I had expected.

vladie0 commented 3 years ago

Hi grst,

I have a new batch of similar data and was wondering what your progress is on optimizing ir_neighbors for sparse matrices. The reason I'm asking is that I have a new, similar project and was wondering whether I should receive a large memory server again ?

grst commented 3 years ago

was wondering what your progress is on optimizing ir_neighbors for sparse matrices

I don't want to promise anything, but if all goes well I should have something ready to test in 2-3 weeks.

grst commented 3 years ago

Hi @vladie0,

just wanted to let you know that the 2-3 weeks won't work out. There were some setbacks and I now need to commit to other projects for a while.

Sorry about that, Gregor

grst commented 3 years ago

Hi @vladie0 and @nicoac,

230 is finally merged. Could you give it a try and check if everything works as expected now?

You can install the development version using

pip install git+https://github.com/icbi-lab/scirpy.git@master

Note that the procedure for calling clonotypes has slightly changed, in particular:

The documentation of the development version is at https://icbi-lab.github.io/scirpy/develop and this tutorial section describes the updated clonotype definition.

nicoac commented 3 years ago

I will report back with results. Thank you so much!

grst commented 3 years ago

Hi @nicoac,

glad to hear that it has fixed the issue for you! I'm not sure I fullly understood what you want to achieve... Do you want to define clonotypes only based on the v-genes and ignore sequence information completely?

nicoac commented 3 years ago

Hi Gregor. I'm realizing that that would be a bit of a silly way to do it. I'm going to delete the question. Thank you!