Improve performance of opening data explorer for a large pandas data frame

wesm commented 2 months ago

Positron Version:

main branch as of April 22, 2024

Currently the data explorer does not display anything until the initial set of schema, data, and column null count requests go through. For a 33M row data frame, for example, this results in a delay of several seconds while these things compute

Screencast from 2024-04-22 17-30-28.webm

A few things to consider:

The DataExplorerCache.doUpdateCache method serially requests schemas, null counts, and data and does not fire an update event until all three have returned. These requests should be broken up and computed asynchronously
In Python, expensive computations should probably be moved into coroutines so they don't block cheap computations (like raw data requests)
Computing the null counts should definitely compute asynchronously rather than blocking
It should be possible to render the data waffle before knowing the column schemas, and since with large pandas schemas, getting the schemas can actually be expensive

wesm commented 2 months ago

As shown in #2881, recomputing the null count profile statistics also impedes updating the waffle after applying a filter

wesm commented 1 month ago

This issue shouldn't require any backend changes. We need to compute the null counts asynchronously rather than block the initial loading of the waffle on the null counts request returning

posit-dev / positron

Improve performance of opening data explorer for a large pandas data frame #2851

Positron Version: