Closed alimanfoo closed 11 months ago
Here is essence of code to draw a dendrogram efficiently with plotly:
# This is needed to avoid RecursionError on some haplotype clustering analyses
# with larger numbers of haplotypes.
sys.setrecursionlimit(10000)
# Compute pairwise distances.
dist, phased_samples, n_snps = ag3.haplotype_pairwise_distances(...)
# Perform hierarchical clustering.
Z = scipy.cluster.hierarchy.linkage(dist, method=linkage_method)
# Get scipy to build a dendrogram but not plot it.
dend = scipy.cluster.hierarchy.dendrogram(Z, count_sort=True, no_plot=True)
# Compile the line coordinates into a single dataframe.
px_segments_x = []
px_segments_y = []
for ik, dk in zip(icoord, dcoord):
# Adding None here breaks up the lines.
px_segments_x += ik + [None]
px_segments_y += dk + [None]
df_px_segments = pd.DataFrame({'x': px_segments_x, 'y': px_segments_y})
# Convert X coordinates to haplotype indices.
df_px_segments["x"] = (df_px_segments["x"] - 5) / 10
# Plot the lines.
fig = px.line(df_px_segments, x="x", y="y")
# Can add a scatter trace for the leaves too.
nl = len(dend["leaves"])
fig.add_scatter(x=np.arange(nl), y=[-1]*nl)
For large haplotype clustering problems, the current plotly dendrogram implementation doesn't perform particularly well. E.g., for ~10,000 haplotypes it can take more than a minute just to build the plot. I think this is because each set of three lines joining two clusters is added to the plot as a separate trace, so plotly spends a long time in the
update_traces
function.Instead it would be possible to plot all lines as a single trace. Some rough benchmarking suggests this can execute in well under 1s.