ENH: select from a MultiIndex using a list of composite keys

arengel commented 1 year ago

Feature Type

[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

I want to be able to select parts of a DataFrame by specifying a list of labels that do not cover all levels of a MultiIndex. This comes up when some subset of the levels of a MultiIndex comprise a composite key and need to be selected / slice together. As far as I can tell, the current slicing and selection tools either require you to specify all levels in the list of keys, or in slicing, treat the levels independently.

And to be clear, I don't mean to be proposing a new way of doing something that already exists. I offer this in case this problem does not already have a convenient solution, if such a solution exists, I'd love to know about it!

Taking the example from Advanced Indexing, the setup is the same giving

In  [55]: dfmi
Out [55]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

Let's say the first two levels are my composite key and I want to select [("A0", "B0"), ("A3", "B1")], the result I want is this:

lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
A3 B1 C0 D0  225  224  227  226
         D1  229  228  231  230
      C1 D0  233  232  235  234
         D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

Approaches I have tried, result in the selecting all the combinations of ["A0", "A3"] and ["B0", "B1"], such as below:

In  [56]: dfmi.loc[(['A0', 'A3'], ["B0", "B1"]), :]
Out [56]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C0 D0   33   32   35   34
         D1   37   36   39   38
      C1 D0   41   40   43   42
         D1   45   44   47   46
      C2 D0   49   48   51   50
         D1   53   52   55   54
      C3 D0   57   56   59   58
         D1   61   60   63   62
A3 B0 C0 D0  193  192  195  194
         D1  197  196  199  198
      C1 D0  201  200  203  202
         D1  205  204  207  206
      C2 D0  209  208  211  210
         D1  213  212  215  214
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C0 D0  225  224  227  226
         D1  229  228  231  230
      C1 D0  233  232  235  234
         D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

Feature Description

Not being well versed in pandas internals, I won't try and propose how this could be implemented. For how it might be accessed, I could imagine two potential options.

Through df.loc so the above example would look something like this
```
df.loc[[("A0", "B0"), ("A3", "B1")], :]
```
Or extend df.xs() so that the key argument could accept a list of keys resulting in something like this

df.xs(key=[("A0", "B0"), ("A3", "B1")], level=(0, 1))

Alternative Solutions

My current approach to this problem is to either concatenate the composite key together and select it conventionally, or to turn my list of keys to select into a DataFrame, reset the index, and select using merge like this:

        dfmi.reset_index().merge(
            pd.DataFrame.from_records(
                [("A0", "B0"), ("A3", "B1")], columns=["level_0", "level_1"]
            ),
            on=["level_0", "level_1"],
            how="right",
            validate="m:1",
        ).set_index(["level_0", "level_1", "level_2", "level_3"])

Additional Context

No response

TrentonBush commented 1 year ago

I love this feature request, it fits a use case I run into all the time with "internet of things" style time series data. In those cases, the primary key of the dataset is something like ("device_id", "sensor_type", "timestamp") and setting a multiindex with those fields is useful. But I often want to select a specific subset of sensors specified by (device_id, sensor_type) pairs while keeping the full timeseries for each one.

I've never found a good way of doing that with a multiindex, though maybe (hopefully!) I'm simply not aware of the right syntax. I instead resort to either concatenating keys together or constructing boolean series, which a) feel a little verbose and cumbersome and b) sacrifice performance due to their inability to use existing indices.

samukweku commented 1 year ago

One option is with a list comprehension:

indices = [ind for ind in dfmi.index if ind[:2] in (('A0','B0'),('A3','B1'))]

In [114]: dfmi.loc[indices]
Out[114]:
lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
A3 B1 C0 D0  225  224  227  226
         D1  229  228  231  230
      C1 D0  233  232  235  234
         D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

Alternatively you can use Index.get_locs :

ind = dfmi.index
ind = [ind.get_locs(tup) for tup in (('A0','B0'),('A3','B1'))]
ind = np.concatenate(ind)
dfmi.iloc[ind]

lvl0           a         b
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
A3 B1 C0 D0  225  224  227  226
         D1  229  228  231  230
      C1 D0  233  232  235  234
         D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

You can abstract the process with select_rows from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

dfmi.select_rows(('A0','B0'), ('A3','B1'))
Out[40]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
         D1   29   28   31   30
A3 B1 C0 D0  225  224  227  226
         D1  229  228  231  230
      C1 D0  233  232  235  234
         D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

pandas-dev / pandas