Closed vladie0 closed 3 years ago
Hi @vladie0,
what metric and what options do you use for ir_neighbors
? For only 55k cells this seems like a lot!
There are several points to be considered here:
cell
x cell
neighborhood graph has an entry in the matrix. cutoff
will reduce the number of edges, and therefore memory consumptionBest, Gregor
Dear Gregor,
Thank you for your swift reply. I've reduced the number of edges to 2, but still have the same issue. The ouput is as you can see below, it is stuck at this stage, memory usage ramps up till 128gb than system fails.
Ar there other parameter that can be of importance, for example metric, n_jobs, sequence? ...
Note, I've also merged TCR & adata data, prior to ir_neighbors calculations. I've tried the following options without succes:
ir.pp.ir_neighbors(adata) ir.pp.ir_neighbors(adata, receptor_arms="all", dual_ir="primary_only") ir.pp.ir_neighbors( adata, metric="alignment", n_jobs=1, sequence="aa", cutoff=2, receptor_arms="any", dual_ir="primary_only", )
ir.pp.ir_neighbors( adata, metric="levenshtein", n_jobs=8, sequence="aa", cutoff=10, receptor_arms="any", dual_ir="primary_only", )
Even with alignment="identity", the same memory issue persists
This issue keeps occurring even when the number of cells are down sampled to 10000 cells.
So for this, I would indeed expect memory issues:
ir.pp.ir_neighbors(
adata,
metric="levenshtein",
n_jobs=8,
sequence="aa",
cutoff=10,
receptor_arms="any",
dual_ir="primary_only",
)
A levenshtein distance of 10 will connect almost all cells - A threshold of 1 or 2 makes more sense here, 10 is better for the alignment metric.
For the others, in particular ir.pp.ir_neighbors(adata)
, this should not happen.
Just to be sure, when testing the identity
metric, and the downsampled data, did you restart the Python kernel to make sure you start with a clean memory?
Let's also do some checks on your anndata. Can you please execute the following commands, they will count the most frequent CDR3 sequences:
adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head()
Yes, memory was cleaned each time. I performed your request, this was calculated instantly
Ah wow, that explains. You essentially only have three different clonotypes in your data, with 10k+ cells each. Is that what you expect from your data?
This will lead to three "dense" blocks of 12931
x 12931
+ 13784
x 13784
+ ... in the sparse matrix, which is obviously a lot of memory.
This will be fixed by #191, but I'll still take some time to implement it.
On the other hand, with only three different clonotypes, the clonotype network does not make a lot of sense anyway.
EDIT: as it's only showing the top 5, there might still be some more with low counts. In that case it would still make sense to build the network.
A "workaround" would be to downsample only the cells that belong to that abundant clonotype -- or just live with the memory consumption if your system can handle it.
Ok thank you, there are 267 clonotypes, but indeed with a lower presence. I'm quite new to TCR so I'm still trying to figure things out. I will certainly lookout for the next updates, thank you for your aid and the development of an awesome package.
Would you suggest to remove the top 3 clonotype and build a network on this data subset or would you still consider it unnecessary to define clonotypes using scirpy and carry on with subsequent analysis?
First I would wonder: is there a biological explanation for the clonotypes being so skews, or is this a technical artifact? I have never seen such an unequal distribution of clonotypes before, they usually follow a Power Law distribution.
I think it is still interesting to look at a clonotype network. As a temporary workaround, you could remove all cells with these abundant clonotypes, except a few representatives. Or if you have the memory, just perform the analysis as usual.
Yes, these are cell lines of T-cells harvested from a patient, with a certain disease. I suppose this explained the skewed distribution.
I will try to implement the workaround, thank your support
One more question, I've received a 1T mem machine to perform calcuation on the full DB, what exact parameters would you suggest to use to perform ir_neighbors?
I've received a 1T mem machine to perform calcuation on the full DB,
fingers crossed that works out :crossed_fingers:
what exact parameters would you suggest to use to perform ir_neighbors?
If you want to construct the network based on "similar clonotypes that likely recognize the same antigen",
I would go for the alignment
metric and a cutoff of somewhere between 10
and 15
(we don't have empirical evidence to back up an ideal threshold).
Since you have so few clonotypes, I would probably use dual_ir = "any"
and receptor_arms = "all"
(possibly even receptor_arms = "any"
)
It depends a bit what your question is. The options are also explained here.
I'm running into a similar issue with a dataset generated from T cells with a fixed beta chain. Seems as if the computation gets stuck and this is what it looks like: 0%| | 0/669 [00:00<?, ?it/s]
Looking forward to the fix! I'm really enjoying using Scirpy.
Hi @nicoac,
to verity that it is really the same issue, could you please also report the result of the following two command?
adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head()
Thanks for the response. @grst Here is the result of that code:
IR_VDJ_1_cdr3 CASSSPGTANYAEQFF 25647 nan 9079 None 242 CTCSAEGDRQAPLF 6 CAWSLGQQNTLYF 6 dtype: int64
As you can see. It is quite a large amount of one specific chain. Also just to be clear. This is the result of combining six different adata frames. I have three day zeros and three day 8 samples that I am tracking longitudinally to see TCR alpha chain pairings over time.
Thanks for checking, seems indeed the same issue. I'll try to fix it after the holidays.
Hi Gregor. Just following up with you on some updates. I was able to get the numbers to crunch properly all the way through the ir_neighbors and the end of the analysis as per the tutorial. However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables.
Any thoughts?
However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables.
I'm not sure what you mean with the coord-dictionary and what exactly is getting wiped. Can you provide an example?
Here is what is happening after a fresh run of the code
What probably happens is that the jupyter kernel gets killed because it runs out of memory (it seems pp.ir_neighbors
never finished). As the jupyter kernel restarts, all your objects get lost.
Unfortunately, there's currently not a lot you can do except for
I'm working on a fix in #230, but it turns out to be more difficult than I had expected.
Hi grst,
I have a new batch of similar data and was wondering what your progress is on optimizing ir_neighbors for sparse matrices. The reason I'm asking is that I have a new, similar project and was wondering whether I should receive a large memory server again ?
was wondering what your progress is on optimizing ir_neighbors for sparse matrices
I don't want to promise anything, but if all goes well I should have something ready to test in 2-3 weeks.
Hi @vladie0,
just wanted to let you know that the 2-3 weeks won't work out. There were some setbacks and I now need to commit to other projects for a while.
Sorry about that, Gregor
Hi @vladie0 and @nicoac,
You can install the development version using
pip install git+https://github.com/icbi-lab/scirpy.git@master
Note that the procedure for calling clonotypes has slightly changed, in particular:
pp.ir_neighbors
has been replaced by pp.ir_dist
tl.define_clonotypes
now takes the dual_ir
and receptor_arms
arguments (they were previously provided to ir_neighbors
)The documentation of the development version is at https://icbi-lab.github.io/scirpy/develop and this tutorial section describes the updated clonotype definition.
I will report back with results. Thank you so much!
Hi @nicoac,
glad to hear that it has fixed the issue for you! I'm not sure I fullly understood what you want to achieve... Do you want to define clonotypes only based on the v-genes and ignore sequence information completely?
Hi Gregor. I'm realizing that that would be a bit of a silly way to do it. I'm going to delete the question. Thank you!
Dear authors,
Currently, the ir_neighbors algorithm consumes over 100G of memory of data from 55k cells, causing memory failures on the server. Is there any way to limit memory consumption and is this normal behavior, which parameters can I adapt to control memory usage ?
Kind Regards,