malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Improve performance of plotly dendrogram implementation #456

Closed alimanfoo closed 9 months ago

alimanfoo commented 9 months ago

For large haplotype clustering problems, the current plotly dendrogram implementation doesn't perform particularly well. E.g., for ~10,000 haplotypes it can take more than a minute just to build the plot. I think this is because each set of three lines joining two clusters is added to the plot as a separate trace, so plotly spends a long time in the update_traces function.

Instead it would be possible to plot all lines as a single trace. Some rough benchmarking suggests this can execute in well under 1s.

alimanfoo commented 9 months ago

Here is essence of code to draw a dendrogram efficiently with plotly:

# This is needed to avoid RecursionError on some haplotype clustering analyses
# with larger numbers of haplotypes.
sys.setrecursionlimit(10000)

# Compute pairwise distances.
dist, phased_samples, n_snps = ag3.haplotype_pairwise_distances(...)

# Perform hierarchical clustering.
Z = scipy.cluster.hierarchy.linkage(dist, method=linkage_method)

# Get scipy to build a dendrogram but not plot it.
dend = scipy.cluster.hierarchy.dendrogram(Z, count_sort=True, no_plot=True)

# Compile the line coordinates into a single dataframe.
px_segments_x = []
px_segments_y = []
for ik, dk in zip(icoord, dcoord):
    # Adding None here breaks up the lines.
    px_segments_x += ik + [None]
    px_segments_y += dk + [None]
df_px_segments = pd.DataFrame({'x': px_segments_x, 'y': px_segments_y})

# Convert X coordinates to haplotype indices.
df_px_segments["x"] = (df_px_segments["x"] - 5) / 10

# Plot the lines.
fig = px.line(df_px_segments, x="x", y="y")

# Can add a scatter trace for the leaves too.
nl = len(dend["leaves"])
fig.add_scatter(x=np.arange(nl), y=[-1]*nl)