Open eric-czech opened 4 years ago
Sorry didn't see this notification until now. Turns out I wasn't getting notifications... switched to "Watch" mode.
Let me look into the number of connected components in STRING.
I think a heatmap of any of the following would help distinct protein clusters: affinity / common neighbors / random walk distance / shortest path. I've been playing around with plotly. Would be really cool if we could get the heatmap and scatterplot to sync up, such that selecting on the scatterplot would create a synced heatmap.
Ah yea that would be very cool. If you go the dash route, this could be a helpful resource on linking scatter plot clicks or lasso/range selections to events in other figures.
It would also be cool if there was a way to link groups of scatter points so that clicking on one of them does something corresponding to the whole group, like highlight all other points in the same connected component. I think Plot.ly could do that if you made sure all the points for one connected component are in the same trace, but I'm not sure if they provide the trace name on click (I think you can assign a point id though and do a reverse-lookup).
I looked into connected components in 12.connected-components.ipynb
. Amazing scipy was able to read all 12 matrices (one per evidence channel) and calculate the components in under 5 seconds.
I plotted the cumulative coverage of components ranked by size (interactive version in notebook):
For the combined score, the largest component contains 98.9% of all genes. This obviously could change if we applied a score threshold of 500. Although in general, it's probably best if we use edge weights rather than binarization as much as possible.
I think it would be a very helpful annotation on embedding visualizations
Yeah, I exported the component assignments in e3998c54c104fed617216af90e3404d6f70cec6f, so we can always add this to the embedding. Maybe it'd be helpful to differentiate all genes not in the giant connected component.
Actually upon further investigation, all genes for combined_score that are not in the giant connected component have a component size of 1, i.e. are entirely disconnected. I think these genes actually drop out during the node2vec embedding stage, so they aren't in the visualization.
If you go the dash route, this could be a helpful resource on linking scatter plot clicks or lasso/range selections to events in other figures.
Wow that Explorer UI is really cool. Will check with you before proceeding with any dashboarding solution using dash
or voila
, since you have much more experience here than me (zero exp).
It would also be cool if there was a way to link groups of scatter points so that clicking on one of them does something corresponding to the whole group
Yeah! Definitely. I'll look into migrating the Bokeh scatterplot to Plotly, which seems a bit more powerful, intuitive, and compatible with voila / notebooks.
I exported the component assignments in e3998c5
I think it would be good to have that for a few combined score thresholds (Jack was simply ignoring all below 900, which I saw in a publication or two as well)
Will check with you before proceeding with any dashboarding solution
I'm always happy to riff about Dash! But I don't mean to bias you too much towards it. I'd love to know what other solutions can do. I default to that simply b/c I like Plot.ly and assume, probably incorrectly, that other libs don't do useful/interesting things beyond what it can. I definitely agree that the API is more intuitive though.
I'm fairly certain a good number of these exist in STRING and I think it would be a very helpful annotation on embedding visualizations (i.e. it would be good context to know which clusters are different graphs entirely vs less related groups of proteins in the same graph).