Calling len(pd.DataFrame(...)) will currently materialize the frame's Index, and return the length of the pd.Index object. This PR adds a get_axis_len method to the query compiler to potentially avoid this materialization when determining the length of the columns or index.
This may not make a large difference for existing backends, as the underlying PandasDataFrame caches the index/column labels together with the length of that axis. However, other backends may choose to cache the shape separate from the actual labels, and this extra method lets us potentially avoid materializing those labels. As such, frontend methods that previously called len(df.index) should instead call the equivalent len(df) to avoid potentially triggering this materialization.
[x] first commit message and PR title follow format outlined here
NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
What do these changes do?
Calling
len(pd.DataFrame(...))
will currently materialize the frame's Index, and return the length of thepd.Index
object. This PR adds aget_axis_len
method to the query compiler to potentially avoid this materialization when determining the length of the columns or index.This may not make a large difference for existing backends, as the underlying
PandasDataFrame
caches the index/column labels together with the length of that axis. However, other backends may choose to cache the shape separate from the actual labels, and this extra method lets us potentially avoid materializing those labels. As such, frontend methods that previously calledlen(df.index)
should instead call the equivalentlen(df)
to avoid potentially triggering this materialization.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date