modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.91k stars 653 forks source link

PERF-#7397: Avoid materializing index/columns in shape checks #7398

Closed noloerino closed 2 months ago

noloerino commented 2 months ago

What do these changes do?

Calling len(pd.DataFrame(...)) will currently materialize the frame's Index, and return the length of the pd.Index object. This PR adds a get_axis_len method to the query compiler to potentially avoid this materialization when determining the length of the columns or index.

This may not make a large difference for existing backends, as the underlying PandasDataFrame caches the index/column labels together with the length of that axis. However, other backends may choose to cache the shape separate from the actual labels, and this extra method lets us potentially avoid materializing those labels. As such, frontend methods that previously called len(df.index) should instead call the equivalent len(df) to avoid potentially triggering this materialization.