PAGA of the 1.3mln 10x dataset

dkobak commented 6 years ago

I have seen two PAGA plots for the 1.3mln 10x dataset: Figure S12 in https://rawgit.com/falexwolf/paga_paper/master/paga.pdf and slide 17 in http://falexwolf.de/talks/180313_HCA_RedBox_Scanpy.pdf. There are some similarities, but overall they look drastically different.

For example, in the latter plot, cluster 6 (magenta) seems to be strongly connected (thick black lines) to clusters 3, 17, 24, but does not have any connections to those clusters on the former plot.

Are these two plots using the same cluster numbering? If so, which one should one rather believe, and what explains such a strong difference?

PS. Based on my initial exploration of this dataset, my current understanding of the "large scale" geometry in this dataset does not really fit to either of these plots. Happy to discuss this here or elsewhere, but this would require posting some figures.

falexwolf commented 6 years ago

Hi Dmitry,

Figure S12 in the preprint is the result of the latest PAGA version. The one in the talk is the result of the initial version and is about 10 months old.

Also, the clusterings don't agree, Figure S12 has just been run for the sake of demonstrating scalability. The one in the talk is related to the early stage of an actual analysis of the data with annotations. It was done by somebody else. I'd have to see how far this went and whether we could release the annotations. I think we shared them with the Hemberg lab already, but I need to check.

PS. Based on my initial exploration of this dataset, my current understanding of the "large scale" geometry in this dataset does not really fit either of these plots. Happy to discuss this here or elsewhere, but this would require posting some figures.

The initial PAGA version contained bugs. So I wouldn't trust this very much either. The current PAGA version should give you a decent view on the global topology of the data. But I've never done any checks or comparisons. What we learned from a collaborator on the old PAGA analysis was that some of the edges made sense in terms of differentiation trajectories... but all of this was very superficial and I wouldn't assign much meaning to it.

I'm very interested in discussing aspects of geometry and topology of the data further shedding light on how different methods are able to extract it.

I already send you the clustering shown in Figure S12 and here, right? If not, here's the link.

dkobak commented 6 years ago

1) OK, thanks! I will ignore the figure in the talk then.

2) Yes you did share the clustering! Thanks a lot. That dropbox file seems to be missing one cell (the 1st one, i.e. it starts from the 2nd cell), but it's a minor complaint.

3) If you have labels/annotations for theses 39 clusters and you could share them, it would be great!

4) I will share some of my analysis (that I said is disagreeing with the Figure S12) tomorrow. Let me know if you prefer to discuss this via email, otherwise I'll post here.

5) You mentioned in the email that PAGA on Figure S12 uses UMAP cluster medians as the points location. I assume that's not a standard approach for PAGA, right? What is actually the standard way you arrange the points in PAGA? I've looked through the paper but could not find this explained; I must have overlooked!

falexwolf commented 6 years ago

OK, maybe for some reason exactly 1 cell was filtered.
I'll talk to people...
Happy to discuss here... as you prefer. 🙂 Looking forward to seeing this!
You'd just use some standard graph drawing, like Fruchterman-Reingold. See the default for the layout parameter here. It's obviously easy to draw this as the graph is so small.

dkobak commented 6 years ago

Moin!

So here it goes. For the context, I've been working on ways to make t-SNE reflect the global structure more faithfully, and we are hopefully going to release this as a preprint soon (it's not a new algorithm, but more like "tricks of trade" to make t-SNE work better). One thing that I want to add to this preprint is the 1.3mln dataset, that's why I am looking at it right now.

Anyway, here are three ways to look at the dataset. I do some standard-ish preprocessing: gene selection (2000 genes), library size normalization, log(x+1) transform, PCA down to 50 PCs. That's my data from now on.

First, let's look at PC1/2:

pca

One thing that is really prominent here is that clusters 17 & 36 are (together) separated from everything else. 21 & 24 (together) also seem to be separate.

Second, let's take class means and do MDS on class means. I am doing 1000 random restarts to be sure that it's a good MDS solution.

mds

36 & 17 are again together and away from everything else. 21 and 24 also. Now 27 also looks like a far removed cluster. The rest is harder to interpret at this moment.

Third, I randomly select 25k points and do t-SNE with very large perplexity, in this case perplexity=1000. This is not a standard way to apply t-SNE but it helps to capture global geometry.

tsne-perpl1000-25k

A whole bunch of clusters seem to be rather isolated, among them again 17 & 36, 21 & 24 & 27, and some others.

It seems that there is some agreement between these three ways of looking at the data. But comparing it with PAGA from Figure S12, I don't see many similarities...

Actually after writing all the above, I decided to quickly run UMAP on the same random sample of 25k cells and got this with default settings:

umap-25k-neighb15mindist01

and this with n_neighbors=50, min_dist=0.5 (corresponding to larger perplexity values):

umap-25k-neighb50mindist05

This matches pretty well to everything from above and does not match at all to Figure S12. Based on the expression of Gad1/Gad2, the supercluster 3-12-7-23 is probably interneurons, and is quite well isolated from the rest, but is non-existent in Figure S12.

I'm not sure what's happening here. I need to run UMAP on the whole data to see if I get something similar to Figure S12. If yes, then it would mean that what one sees with 25k cells does not generalize to 1.3mln cells at all (which would be weird). If no, then we must be doing some very different preprocessing (would also weird). Puzzle!

dkobak commented 6 years ago

Update. I ran UMAP with default settings on the whole dataset. Took me 1h 17min on my desktop; it's probably so much faster than your 3h because now I see that you did not do PCA preprocessing and ran UMAP directly on 1000 selected genes. This should make the annoy quite a bit slower.

The resulting UMAP is qualitatively similar to the ones I posted above (e.g. 3-12-7-23 supercluster is far away from everything else etc.). So the whole issue is not about PAGA but about why we are getting so different results with UMAP. Is it due to different feature selection? Or due to PCA preprocessing? Puzzling. In my experience these things don't usually matter too much.

falexwolf commented 6 years ago

Hi Dmitry, sorry, for sometimes being a bit slow with iterating through all the issues.

Yes, this looks like we simply ran different preprocessing steps. The link that I sent about the run times vie email (here) also contains the script that I used to produce the figure (here). The preprocessing done there is documented here and computes neighbors on 50 PCs. I'm sure that the differences we see are entirely due to to the different preprocessing.

Your runtime of 1 h is extremely impressing! Are you sure that you didn't run 8 cores or something like that? The results I gave were achieved on 2 cores. The UMAP version that I was using also didn't have very good parallelization at the time.

dkobak commented 6 years ago

Yes, I've looked at the code of recipe_zheng17() but to be honest I don't see PCA in there. Is it invoked at some later stages in cluster.py? I'm pretty sure that UMAP on 1300000x1000 matrix will be quite a bit slower than a UMAP on 1300000x50 matrix due to the initial NN search. I'm using standard UMAP implementation with default parameters, I don't think it implements any multithreading (but I'm not sure).

In any case, in all my previous experience with RNA-seq data, differences in preprocessing can influence the resulting t-SNE/UMAP a little bit, but I would not expect drastic changes like we are seeing here. I'm selecting 2000 genes, doing library size normalization, log1p transform, and PCA down to 50 PCs. It looks like recipe_zheng17 selects 1k genes, does normalization, log1p, and as I understood you, 50 PCs are selected later. The only difference is then different "most variable" gene selection. Can this really yield entirely different global geometry? How strange would that be!

I guess I'd need to install scanpy and run your script to dig into this deeper.

falexwolf commented 6 years ago

Hey!

pp.neighbors falls back to the PCA representation if use_rep is unspecified and the data is high dimensional. You're right that I should have made an explicit call for transparency and not used the fallback.
The preprocessing dramatically influences the global geometry of the data. I've benchmarked on many major toolkits, preprocessed data for standard use cases (as the Paul et al. (2015) data in this repo) using different standard approaches (e.g. Seurat vs. Cell Ranger vs. even more simple approaches) and several completely hide the smooth and biological meaningful manifold structure that is shown on the front image of this repo. Also, for complicated datasets, many people spend months into getting their preprocessing right. For the 1.3 M dataset, I'm sure that neither the default pp.recipe_zheng17 nor your default preprocessing steps reveal the "true biological manifold". Of course, this becomes only somewhat meaningfully defined if you say which part of the variation the manifold should be represented (cell cycle, for instance, or developmental processes, or dose response, or some cell types), but even then things remain difficult.
It's cool that you managed to re-engineer tSNE so that more global aspects of the topology are correctly captured! :smile: As you have experience: what do you think about the evaluation measure in Supplemental Note 4 of the PAGA preprint?
UMAP is meant to run in parallel, yes... I'm very familiar with the code as I used it for pp.neighbors (I'm also prominently citing it for that).

dkobak commented 6 years ago

OK, I see about 50 PCs. Then the only meaningful difference that I see is that recipe_zheng17 selects 1k most variable genes based on a particular procedure and I select 2k genes based on a different procedure. The rest of the preprocessing is identical. I've never seen this influencing the results so strongly but perhaps you are right that it might. I guess I should check what I get when I use feature selection from Zheng 2017.

Re evaluation measure from PAGA Suppl Notes, I think using random-walk distances is a neat idea, but I would need to try it out on some datasets that I know to be able to say how well it works in practice... That's something I'd be interested in trying out.

falexwolf commented 6 years ago

Yes, simply run pp.recipe_zhengh17 on the 25K and see whether it influences things. I think a very good use case is also the Paul et al. (2015) dataset, as it's been used by a very large number of trajectory inference papers as a benchmark.

I'm really interested in seeing how you quantify how much your improved version of tSNE improves the global topology. I've thought quite a bit about this question and the only thing I could come up with was the random-walk based idea in Suppl. Note 4 of the preprint. But I happily admit that this is just a very first step towards formalizing and solving the problem, I simply couldn't find anything on the general question in the literature, not even a systematic approach to formalizing the problem - were you able to find something else in the literature?

dkobak commented 6 years ago

Not really. To be honest, I've not been using any explicit quantification (sorry to disappoint...). I've been mostly working with datasets where the global structure is to a certain extent known a priori. For example, in Allen institute cortical datasets, there are interneurons, excitatory neurons, and non-neural cells, and we know that they have very different expression, so one would expect to see these three groups of cells meaningfully separated a t-SNE plot. Simply running t-SNE in the default way usually splits each of these groups into multiple isolated subgroups and all of them get randomly shuffled around and mixed up. So we mainly focused on how to run t-SNE such that this does not happen.

I will run recipe_zhengh17 and try to pinpoint where the difference comes from (hopefully in the next couple of days).

dkobak commented 6 years ago

Update. I ran recipe_zhengh17, and used the selected 1k genes in my code. I have also noticed that this recipe standardizes the genes after log1p-transorming, which is something I usually don't do. So I tried both UMAP and t-SNE with default settings on a random sample of 25k cells with and without standardizing.

I see some substantial differences depending on the preprocessing, but overall the results look rather similar. In all cases the putative interneurons (classes 3-12-7-23; probably also 18 and 25) are separated from the rest and especially class 3 clearly stands out on its own. Whereas on your UMAP plot class 3 is tightly merged into other clusters, e.g. class 1.

So it does not seem that feature selection is the main culprit here. Something else is. I'm getting really curious.

(Just to be clear: I am not using scanpy for any of that apart from exporting the list of genes from recipe_zhengh17...)

falexwolf commented 6 years ago

Why aren't you using Scanpy for the full preprocessing, that is, why don't use the data matrix from pp.recipe_zhengh17 or even the PCA (which is just scikit-learn) from Scanpy? Only using the list of genes still leaves significant further explanations and I'm sure that we'll find the explanation there - only explanation could be that the core functionality of UMAP changed dramatically in the past 3 months, which I don't believe [even though I'm currently looking into 0.3].

Running it on 25K all of this is a matter of minutes and we'll have something comparable to talk about... Don't you agree?

I could run the exact same script and we'll have a transparent comparison that pinpoints differences both in preprocessing and the UMAP implementation... PAGA should behave along these lines...

falexwolf commented 6 years ago

Regarding

Not really. To be honest, [...] OK, I understand... It's a tough question. But already the comparisons you mention are interesting. Still, it could be that reviewers want more than a visual comparison for your paper. But I don't know, it depends a lot on how you set all of this up.

What are your thoughts on UMAP?

dkobak commented 6 years ago

Fair enough.

import scanpy.api as sc
adata = sc.read_10x_h5('1M_neurons_filtered_gene_bc_matrices_h5.h5')
sc.pp.recipe_zheng17(adata)
X = np.copy(adata.X)
X = X - X.mean(axis=0)
U, s, V = np.linalg.svd(X, full_matrices=False)
Z = np.dot(U, np.diag(s))[:,:50]

np.random.seed(42)
ind25k  = np.random.choice(Z.shape[0],  25000, replace=False)

import umap
Zumap = umap.UMAP().fit_transform(Z[ind25k,:])
mlnplot(Zumap, ind25k)

This uses my function mlnplot that plots the data using your cluster ids and scanpy's cluster colors. Can post it you want, but for you it's probably easier to use scanpy.

umap-scanpy-25k

Everything that I was describing above can be seen here too.

This uses UMAP version 0.2.3.

falexwolf commented 6 years ago

OK! So what happens in Scanpy internally would be the following:

import scanpy.api as sc
adata = sc.read_10x_h5('1M_neurons_filtered_gene_bc_matrices_h5.h5')
sc.pp.recipe_zheng17(adata)
# compute the PCA on adata.X
sc.pp.pca(adata)
# take the PCA representation of data
Z = adata.obsm['X_pca']

np.random.seed(42)
ind25k  = np.random.choice(Z.shape[0],  25000, replace=False)
import umap
Zumap = umap.UMAP().fit_transform(Z[ind25k,:])
mlnplot(Zumap, ind25k)

I can run this tonight or tomorrow and we can continue tomorrow.

For of fast run and loading times in these comparisons, I'd tend to use the subsampled data that 10x provides. After all, we don't want to benchmark runtimes but are simply interested in understanding the effects of the different steps in the pipeline. The file is here and is used, for instance, here.

PPS: Using Scanpy, I'd write the rest of the script

np.random.seed(42)
ind25k  = np.random.choice(Z.shape[0],  25000, replace=False)
import umap
Zumap = umap.UMAP().fit_transform(Z[ind25k,:])
mlnplot(Zumap, ind25k)

as follows

sc.pp.subample(adata, fraction=0.025, copy=True)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color='louvain')

dkobak commented 6 years ago

Okay! Looking forward to see what you get.

Regarding UMAP, I think it's possibly great. But I'd like to understand better how it works and in particular which differences to previous algorithms are more or less important. For example, it computes input affinities somewhat differently from t-SNE (different kernel, different symmetrizing) and uses a somewhat different output kernel (unless a=b=1), but if one ignores these differences (that seem to have only a minor effect on real data), then UMAP has the same loss function as largeVis (with gamma=1). But largeVis was published in some obscure journal and is barely known, whereas Leland is now doing a great job at developing a nice actively maintained Python package and at advertising it on Twitter :-) And of course he provided a whole new mathematical framework to motivate this loss function but I'll admit that I don't understand most of it.

So at this moment it's not very clear to me how much of UMAP's greatness is "foundational", how much is an actual improvement it yields over e.g. largeVis, and how much is "implementational".

Practically speaking, when I run it on the Allen data with default settings, the "large scale" arrangement does NOT come out exactly as I want it to come out. It seems that using some of the tricks that I'm using for t-SNE might make sense with UMAP too, but I did not have time to experiment with it enough yet.

falexwolf commented 6 years ago

Interesting background information! I've heard about largeVis but never saw paper or code.

Regarding the cost function: even tSNE and UMAP have - up to a "false negative" term - the same cost function (see for instance Suppl Note 4 of the PAGA paper). So in addition to the nice mathematical grounding - although again non-probabilistic - and the awesome implementation, I think that all of what you consider further differences, as important contributions.

Regarding running your script: I had the office full of meetings this morning and I forgot that I need to give a talk later this afternoo: I still need to prepare it. So, I'm sorry but I don't have the time to play around with the analysis today. Alternatively: why don't you just quickly run the steps using Scanpy? scikit-learn's randomized PCA should be a lot faster (which scanpy uses) than your linalg.svd solution... also fewer lines of code. The clustering within Scanpy is also a lot faster than doing it differently as the graph construction is used for all tools (pseudotime, clusters, trajectories, visualizations...). Hence, also tl.umap is faster than using the UMAP package itself, as it only calls the embedding optimization and not the graph construction... However, Scanpy still misses the updates of UMAP 0.3.

dkobak commented 6 years ago

Regarding the cost function: even tSNE and UMAP have - up to a "false negative" term - the same cost function (see for instance Suppl Note 4 of the PAGA paper).

Not quite. tSNE and UMAP have the same attraction term, but the repulsion term is completely different. Repulsion in tSNE comes from the normalizing all output weights to 1. LargeVis and UMAP use a different repulsion term. From your suppl note 4 it sounds a bit as if you say that tSNE does not have any repulsion term which is not correct. This difference in the repulsion term is IMHO the main difference between tSNE and UMAP. All other differences (kernel choice etc.) I think are minor in comparison.

dkobak commented 6 years ago

np.linalg.svd() took 65 seconds on this matrix on our server. This long I can wait :) For comparison, sc.pp.pca(adata) took 39 seconds.

So I ran this modification now:

import scanpy.api as sc
import numpy as np
adata = sc.read_10x_h5('1M_neurons_filtered_gene_bc_matrices_h5.h5')
sc.pp.recipe_zheng17(adata)
sc.pp.pca(adata)
Z = adata.obsm['X_pca']

np.random.seed(42)
ind25k  = np.random.choice(Z.shape[0],  25000, replace=False)

import umap
Zumap = umap.UMAP().fit_transform(Z[ind25k,:])
mlnplot(Zumap, ind25k)

The result is very close to what I had before:

umap-scanpy-25k-2

I will now run UMAP from inside the scanpy.

dkobak commented 6 years ago

And now for the final test!

import scanpy.api as sc
import numpy as np
adata = sc.read_10x_h5('1M_neurons_filtered_gene_bc_matrices_h5.h5')
sc.pp.recipe_zheng17(adata)
sc.pp.pca(adata)
adata25k = sc.pp.subsample(adata, fraction=25000/adata.X.shape[0], copy=True)
sc.pp.neighbors(adata25k)
sc.tl.umap(adata25k)

Zumap = adata25k.obsm['X_umap']
ind25k_scanpy = np.array([np.where(adata.obs_names==c)[0][0] for c in adata25k.obs_names])
mlnplot(Zumap, ind25k_scanpy)

BTW, is there any way to get the subsampling indices that subsample() used? If not, I'd suggest you save it into the returned AnnData object. This one line that finds ind25k_scanpy was running for 35 minutes (!) :-)

Anyway, here is the outcome:

umap-scanpy-25k-3

It looks like your neighbors + umap use a bit different defaults from umap.UMAP() -- can it be? But in any case, this actually looks reasonably similar to what I've been getting all along. E.g. class 3, together with some other inhibitory clusters, is separated from the "bulk" of excitatory neurons (including class 1). Even using the 100% scanpy pipeline I cannot reproduce your Figure S12 where class 3 tightly merges into class 1!

falexwolf commented 6 years ago

From your suppl note 4 it sounds a bit as if you say that tSNE does not have any repulsion term which is not correct. This difference in the repulsion term is IMHO the main difference between tSNE and UMAP. All other differences (kernel choice etc.) I think are minor in comparison. Yes, it could be that I'm missing a "repulsion term" in tSNE, I just took the cost function from the original paper, which I think doesn't have it. Essentially you say that equation (40) of the PAGA preprint is incorrect. I'm a bit puzzled because this was just copy and paste.

There is a way to get indices as mentioned in the docs of pp.subsample: simply pass adata.X instead of adata.

neighbors + umap indeed uses different default settings (min_dist=0.5), which I empirically found to "look better" on several datasets.

Your result: this is really weird, I'm starting to believe that cluster labels are inconsistent... I'll try to run this right now.

dkobak commented 6 years ago

Yes, it could be that I'm missing a "repulsion term" in tSNE, I just took the cost function from the original paper, which I think doesn't have it. Essentially you say that equation (40) of the PAGA preprint is incorrect. I'm a bit puzzled because this was just copy and paste.

Equation (40) is correct!

The t-SNE loss function in the original paper is indeed written as \sum p_ij \log (p_ij / q_ij), but note that these q_ij are normalized output distances q_ij = w_ij / \sum (w_ij), where w_ij = 1 / (1 + |y_i-y_j|^2). If one plugs this into the loss and does some algebra, then the loss will split into terms: attractive term \sum p_ij \log w_ij, and repulsive term \log \sum w_ij.

In contrast to this, largeVis and UMAP do not normalize the w's by their total sum. Instead, they introduce a different repulsive term that has its own motivation (the given motivation is different in largeVis and UMAP papers, but the repulsive term is the same).

So if in your equation (44) your q's are supposed to be normalized, as in t-SNE, then this is not the loss function of UMAP.

I recommend the following two write-ups if you want to dig more into this:

I'm happy to discuss these things further because I'd like to understand it better myself (and to test my understanding while explaining it to other people).

PS. Regarding subsampling, so what's the best way to run it if I want to get both, the subsampled AnnData object adata25k and the indices ind25k_scanpy that were used to subsample it?

falexwolf commented 6 years ago

I just recomputed everything on my laptop for the 20 K subsampled data and this is the result which can be reproduced from this notebook, which is linked here. I do everything exactly as you did and load the clusters from the file that I sent you.

The result is consistent with Suppl. Fig 12 of the PAGA preprint. I'm a bit puzzled right now.

falexwolf commented 6 years ago

It seems to me like swapped cluster labels.

Thank you for all your material above. I'll go through it tomorrow. Need to stop now.

dkobak commented 6 years ago

Yes, it does very much look like the labels on my plots are shifted by 1 :-/ I guess this has been my mistake all along, and it's super embarrassing that I have not spotted this proverbial error earlier. I will confirm tomorrow morning when I come to the lab.

dkobak commented 6 years ago

OK I can confirm that I had a i+1 in my mlnplot() function that got copied from the code I used for Allen datasets where they have cluster labels starting from 1 and not from 0.

Truly embarrassing.

I close this issue now. Thanks a lot for your help, I certainly learned something new and familiarized myself with scanpy a little bit.

Happy to keep discussing tSNE/UMAP here or elsewhere if you want.

BTW, would be great to meet at some moment! Tubingen and Munich are close enough, maybe we can organize something in autumn.

falexwolf commented 6 years ago

Ha! :smile: Good that we found a simple solution! Not embarrassing, things like this happen... :wink:

Regarding indexing: did you notice the pandas indexing way for solving the index finding problem that took you 35 min above? It's in box [14] of this notebook.

Regarding subsampling: you would like to have a subsampling function that gives you both the indices and the subsampled adata? After all, having the indices, you can just do adata_sub = adata[ind_25k]. But I wanted to update the function anyways so that it's also possible to pass n_observations instead of a fraction of observations. I could add a return_indices parameters without problem.

Regarding equation (44): Both p and q are probabilities, hence normalized to [0, 1], hence this should agree with what you expect for tSNE. But as I understood it, this should also be the case for UMAP, as far as I understood the paper. But maybe I missed something. Leland already commented a little on the PAGA preprint. In the worst case, I could ask him... But maybe we find this out ourselves. Essentially: why do you think that q is not normalized in the UMAP case? After all, Leland's paper ends with a cross entropy for two distributions (just as eq. (44), but using a different notation), which already implies normalization, in my view.

dkobak commented 6 years ago

Starting from the end: the bigger picture is that what makes t-SNE hard to optimize is precisely the fact that q's are normalized to sum to 1. This makes the loss function not amenable to SGD and one has to resort to Barnes-Hut or FFT approximations for repulsive forces. The big difference in largeVis (that was copied in UMAP) was to replace the -\log \sum w_ij repulsive term with \sum \log (1-w_ij) term that has log inside of the sum and not outside, so one can use SGD.

Based on this understanding, I am pretty sure that q's are not normalized to sum to 1 in UMAP. You are right that they are normalized in a sense that they are in [0,1], but the total sum over all q's is not 1.

Unfortunately, Leland's arxiv preprint is impossible to understand because there are no implementation details there. He told me he made a more detailed write-up as a NIPS submission, but I haven't seen it; I tried to bid to get to review it as a NIPS reviewer, but his paper was not assigned to me in the end. My understanding of the UMAP loss function as it is actually set up in the code relies on the James Melville's very detailed write-ups that I linked above. James re-implemented UMAP in R from scratch, closely following Leland's code, and carefully described all the steps.

Re subsampling: ah, I did not realize that one can do adata_sub = adata[ind_25k]. I guess that's enough for all practical purposes. But now the function returns 2 things when you pass in X and only 1 thing when you pass in adata, so I think it would make more sense to return 2 things always (whenever return is required). Passing n_observations instead of a fraction would definitely be useful.

Re pandas indexing trick: I noticed it and was amazed! How come it's so much faster?

falexwolf commented 6 years ago

All your explanations on the cost function: thank you very much!! Got it! I will read more and might get back to you...

But now the function returns 2 things when you pass in X and only 1 thing when you pass in adata, [...] That's consistent behavior across all of the preprocessing functions that take both annotated matrices (AnnDatas) or matrices (arrays, sparse matrices)... changing it for a single function will confuse users... usually, when you return an AnnData, you don't need to return anything else, as AnnData has all the annotation slots...

Pandas "trick": one of the biggest aims of the pandas indices is allow ultra-fast hashing of indices... so that's what it's built for. The numpy algorithm you used is a search algorithm not based on hashing... for instance, it doesn't use the fact that indices are unique... there will be more to it that I don't know anything about...

dkobak commented 6 years ago

That's consistent behavior across all of the preprocessing functions [...]

OK then don't change it. It was just me not familiar enough with the scanpy environment.

Re pandas -- good to know!

Re cost functions -- sure. There are some aspects of it that I am still confused about myself. I've just been trying to clarify one thing with Leland via email during the last days...

theislab / paga

PAGA of the 1.3mln 10x dataset #1