pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

Datatree: Dynamically populate the HTML repr #9350

Open flamingbear opened 1 month ago

flamingbear commented 1 month ago

What is your issue?

Originally posted by @TomNicholas in https://github.com/xarray-contrib/datatree/issues/206

@andersy005, @jbusecke and I noticed that for big trees (hundreds or thousands of nodes) the HTML repr can become very slow to render, potentially locking up your jupyter notebook.

We think that's because the HTML representing the whole tree is pre-rendered in one go, and hidden by defaulting sections to be closed. If your tree contains thousands of nodes that's a lot of HTML to render.

@andersy005 suggested that perhaps the HTML repr should contain some kind of callback, so that the code to render new nodes is only opened

I don't know if that's possible at all, or whether it would work for reprs rendered in non-interactive environments (such as in xarray's static docs pages).

Illviljan commented 1 month ago

Both html repr and the normal repr are struggling with large datatrees. The normal repr should probably be truncated in similar fashion as the dataset repr: https://github.com/pydata/xarray/blob/ce5130f39d780cdce87366ee657665f4a5d3051d/xarray/core/options.py#L67

Illviljan commented 1 day ago

With this example the html repr takes 3 minutes compared to the 840ms from the normal repr:

import numpy as np
import xarray as xr
from xarray.core.datatree import DataTree

def create_datatree(number_of_files, number_of_groups, number_of_variables):
    datasets = {}
    for f in range(number_of_files):
        for g in range(number_of_groups):
            # Create random data:
            time = np.linspace(0, 50 + f, 100 + g)
            y = f * time + g

            # Create dataset:
            ds = xr.Dataset(
                data_vars={
                    f"temperature_{g}{i}": ("time", y)
                    for i in range(number_of_variables // number_of_groups)
                },
                coords={"time": ("time", time)},
            ).chunk()

            # Prepare for Datatree:
            name = f"file_{f}/group_{g}"
            datasets[name] = ds
    dt = DataTree.from_dict(datasets)

    return dt

number_of_files = 25
number_of_groups = 20
number_of_variables = 2000

dt = create_datatree(number_of_files, number_of_groups, number_of_variables)

# %timeit dt._repr_html_()
# 3min 15s ± 4.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# %timeit dt.__repr__()
# 840 ms ± 29.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)