pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.63k stars 17.91k forks source link

PERF: tab completion with a large index #18587

Open jreback opened 6 years ago

jreback commented 6 years ago

from https://github.com/pandas-dev/pandas/pull/16326#issuecomment-348473324

If you have a very large index, _dir_additions (for tab completion) actually takes quite a bit of time

So what I would do is if the index is say < 100, use the currently _dir_addition, otherwise return an empty list! (its essentially too big to use tab completion for anyhow). can you make this change and add an asv for this (could be a separate PR as well)

jreback commented 6 years ago

cc @jorisvandenbossche @TomAugspurger

BibMartin commented 6 years ago

One may have a very large index with few distinct values. I would suggest to limit the number of values returned rather than the size of the index. (It seems that the delay is due to the handling of the results rather than the computation of dir) Something like:

additions = set([c for c in self._info_axis.get_level_values(0).unique()[:100]
                 if isinstance(c, string_types) and isidentifier(c)])

Anyway, I think I can address this issue in #16326 ; the topics are quite related.

TomAugspurger commented 6 years ago

Do we know why _dir_additions is slow for large objects?

jreback commented 6 years ago

you can use self._info_axis.unique(level=0) here as a generic way to do this.

BibMartin commented 6 years ago

@TomAugspurger

Do we know why _dir_additions is slow for large objects?

I don't know exactly, but the slowdown seem to come from the IHM: When I create a large Series (s = Series(index=tm.makeStringIndex(10000))) in a notebook or in ipython console, then dir(s) is fast (much less than 1 sec) while asking for tab-completion is slow (several seconds).

@jreback

you can use self._info_axis.unique(level=0) here as a generic way to do this.

Yes thanks, that's an awesome new feature.

TomAugspurger commented 5 years ago

Was this fixed by https://github.com/pandas-dev/pandas/pull/20834? Tab completion on the following seems quick

In [21]: s = Series(index=tm.makeStringIndex(10000))

In [22]: s.<tab>
jamespreed commented 5 years ago

I would like to add to this issue. My team often works with data sets that have hundreds of columns. The reduction in the number of columns available for tab-completion to 100 has been a hindrance. I am fine with capping the number for the sake of performance, just the choice of 100 seems arbitrary. Currently I work around this by editing the generics.py file in the pandas/core directory.

Suggestion:

Increase the cap on _dir_additions to 1000.

Analysis

I performed the following benchmarks on tab-completion timings using %timeit in IPython on two separate laptops. In both cases, the benchmarks were created using Pandas 0.25.0, first with install as-is, and again after modifying generics.py to remove the slice in the set-comprehension at line 5199.

Laptop 1

n_cols : benchmark for tab-completion
     1 : 134 µs ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     3 : 143 µs ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     7 : 133 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    10 : 132 µs ± 398 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    30 : 144 µs ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    70 : 133 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
   100 : 692 µs ± 865 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 684 µs ± 874 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 792 µs ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 681 µs ± 870 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 687 µs ± 879 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  7000 : 686 µs ± 875 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 10000 : 761 µs ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 30000 : 698 µs ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 70000 : 692 µs ± 881 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 : 679 µs ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
300000 : 961 µs ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas 0.25.0, modified generics.py

n_cols : benchmark for tab-completion
     1 : 139 µs ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     3 : 132 µs ± 863 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     7 : 153 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    10 : 690 µs ± 847 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 745 µs ± 935 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 623 µs ± 775 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 676 µs ± 863 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 1.61 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 2.08 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 2.59 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 30.3 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  7000 : 57.6 ms ± 750 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
 10000 : 81 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
 30000 : 244 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 70000 : 584 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 : 845 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
300000 : 2.54 s ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Laptop 2

Intel Core i3-3110M at 2.4Ghz Windows 10, 1903 Pandas 0.25.0

n_cols : benchmark for tab-completion
     1 : 3.6 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
     3 : 1.25 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     7 : 1.29 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    10 : 1.27 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 1.42 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 1.66 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 1.83 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 1.94 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 1.92 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 1.89 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 1.93 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  7000 : 2.35 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 10000 : 1.99 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 30000 : 1.82 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 70000 : 1.83 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 : 2.03 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
300000 : 1.86 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas 0.25.0, modified generics.py

n_cols : benchmark for tab-completion
     1 : 1.22 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     3 : 1.21 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     7 : 1.24 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    10 : 1.24 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 1.37 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 1.65 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 1.83 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 3.07 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 5.58 ms ± 8.38 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 26.9 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  3000 : 75.7 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  7000 : 206 ms ± 48.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 10000 : 264 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 30000 : 728 ms ± 6.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 70000 : 1.72 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 : 2.5 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
300000 : 7.43 s ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Even on my 10 year old laptop, the time for tab-completion with 1000 columns is under 30ms. Still very responsive.

jamespreed commented 5 years ago

Additionally, it may be worth issuing a warning to the user when dir is called under the condition that _dir_additions is dropping attributes.

It goes against the philosophy of Python to not let the user know, in my opinion.