scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.88k stars 594 forks source link

Connectivity measure and spurious edges in AGA #96

Closed yueqiw closed 3 years ago

yueqiw commented 6 years ago

Hi,

I'm using AGA to build global trajectories on neuronal differentiation datasets. It works well on small subsets of data (only progenitors or only neurons), but produces spurious trajectories between clusters that cannot be explained (progenitors --> inhibitory neurons --> excitatory neurons, rather than progenitors --> excitatory neurons). I'm thinking that part of this may be due to noise/outliers in the dataset.

From the paper (Supplementary Note 3.2), it looks like the connectivity between two partitions are calculated as the minimum distance between all pairs of points, which is prone to outliers.

Taking the minimum is independent of the specific shape of a partition but is prone to outliers: it is only a viable option as the distance measure d itself is highly robust being computed as an average over all random walks on the graph.

Are there alternative ways to calculate connectivities that are more robust to outliers? (e.g. other connectivity metrics or something like Endpoint Supervision in Slingshot (https://doi.org/10.1101/128843) to avoid connecting endpoints from different lineages.)

falexwolf commented 6 years ago

Hi!

Thanks for reaching out!

We have an option to compute connectivity based on minimum distance, right. The default choice, however, is based on edge-statistics (actual inter-edges between clusters vs. expected number of edges in random connections). Currently, I'm working on the revision of the algorithm. The option for minimum distance will disappear and everything will become much cleaner. I'm also trying to improve the statistical model for connectivity and provide a clearer option for its significance threshold.

Right now, the only relevant option in the whole AGA [given the single-cell graph is computed and clustered] is tree_based_confidence=True; if you set this to False, the significance value for edges to appear will be much lower and you'll get a much sparser abstracted graph. However, this graph is sometimes too sparse. If tree_based_confidence=True, as per default, this works fine on very connected datasets, but sometimes gives results that are too dense on disconnected datasets.

For now, you could simply try setting tree_based_confidence=False and see whether this is satisfying. If not; probably too sparse, it would be great if you could try the new AGA version in a couple of days.

Also, I'd be very happy to run the method on your data and look at specific issues. You can also approach me via email...

Cheers, Alex

yueqiw commented 6 years ago

Hi, just to confirm that I tried the new PAGA functions a while ago and the results look very good. (sorry for the delay of the response. I meant to respond to the thread much earlier but got busy doing other stuff.)

Now I'm wondering about how to interpret the graph connectivities. An undirected graph does not imply whether two connected clusters are sequential (e.g. progenitors -> newborn neurons -> mature neurons) or on different branches but highly correlated (e.g. neuron subtype 1 vs. neuron subtype 2).

Do you think it's possible to use RNA velocity (http://velocyto.org/) to perform quantitative interference on the directionality of the edges? I have the velocity data but not sure how to mathematically infer edge directions. Maybe I should open a new issue on this or approach you via email? Thanks!

falexwolf commented 6 years ago

Hi! Good to read! :smile:

The PAGA edges simply mean that clusters are topologically connected - in the single-cell graph, there is a significant number of inter-cluster-edges, above noise-level. They absolutely don't have an orientation.

Regarding velocyto: yes, it's possible to use it to orient the edges in PAGA. You can get that functionality following this; however, until this becomes really well-documented etc. this will still take a while... the model behind this will also be subject to change, I guess...

Get Scanpy 1.1 for this.

yueqiw commented 6 years ago

Thanks for uploading the notebook! This is exactly what I was looking for! I'll try it out and let you know how it works on my data, and I'll try look into the model (and code) too. Great to see the new version coming out today. Thanks!

falexwolf commented 6 years ago

If you're planning to look into the code: There will be a new version of PAGA in Scanpy 1.2, which will feature two connectivity models... The code will be much clearer. We'll also see whether we can upload an extensive revision of the preprint - unfortunately, the review process at the journal took ages and coming up with the revision, too. All of this should happen in the next days.

yueqiw commented 6 years ago

Where can I find Scanpy 1.2 ?

falexwolf commented 6 years ago

This what I'm currently compiling. While Scanpy 1.1 concerned some basic updates and more general features, Scanpy 1.2 will be about PAGA. No worries, everything is backward compatible... but PAGA will have many more cool features and in addition, also feature a second, better model. I will release it this weekend.

yueqiw commented 6 years ago

Thanks! We're submitting a paper soon, and I'm hoping to incorporate a stable model into it.

falexwolf commented 6 years ago

The current model is stable and has been successfully used in many instances. It will also be available in Scanpy 1.2. In addition, there will be another model. General improvements only regard the ease of use of PAGA and are model-independent anyways.

yueqiw commented 6 years ago

Will the new model also be stable and described in the manuscript/documentation? Thanks!

falexwolf commented 6 years ago

Yes, of course. Both will be permanent, described in the docs and in the paper.

yueqiw commented 6 years ago

Great! Looking forward to it.