h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.82k stars 157 forks source link

[ENH] nth function #3404

Open samukweku opened 1 year ago

samukweku commented 1 year ago

Implement dt.nth(cols, n) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

Closes #3128

oleksiyskononenko commented 1 year ago

Thanks @samukweku. So the issue on this PR is that skipna doesn't work? Why tests fail with "got an unexpected keyword argument n"?

oleksiyskononenko commented 1 year ago

I guess this implementation has an issue even with no skipna. Look at the example below

from datatable import dt, f, by
DT = dt.Frame([1, 1, 2])
DT[:, f.C0.nth(0), by(f.C0)]

Triggers

AssertionError: Assertion 'i < nrows()' failed in src/core/column.cc, line 236

UPD: I guess this issue has never been fixed: https://github.com/h2oai/datatable/pull/3346#issuecomment-1259217559 I propose not to close PR's when there is something wrong with your commits, just let me know and we work together on fixing it. Otherwise, we loose a conversation and some issues like the one above.

oleksiyskononenko commented 1 year ago

As for the skipna, I've been looking into this and it seems that if we just pass the validity mask to Nth_ColumnImpl we could have issues with negative n. Because to figure out which row is the row -X, we first need to establish all the other row ids (skipping missings) and then calculate the X'th row from the end.

So what we need to do is, first, to apply the validity mask to the original frame (filtering out "all" or "any" missings). However, since the rows are already grouped, we may need to adjust the group by...

samukweku commented 1 year ago

my bad ... I am lost on the explanation regarding skipna

samukweku commented 1 year ago

@oleksiyskononenko /home/sam/datatable/docs/api/fexpr/nth.rst: WARNING: document isn't included in any toctree where is the toctree?

oleksiyskononenko commented 1 year ago

@samukweku ah, you just need to add nth() function to the api, just like we do for all the other functions, see index-api.rst.

samukweku commented 1 year ago

@oleksiyskononenko what do you suggest is the way forward for this PR? drop the skipna parameter and let the user handle it instead? Do you mind showing me an implementation for skipna assuming a positive n, so I can understand how FExpr_all and any would be used

samukweku commented 1 year ago

Looking further at pandas' implementation of dropna, the dropna is applied to the entire dataframe (including the groupby column), before the nth function is applied. What we should be doing for nth and any other relevant function is to apply this per column; if the user wants it per row, that should be applied in the i section as a filter, same way we'd do if we were removing nulls from the dataframe:

Using the example from pandas' home page

from datatable import dt, f, by
import pandas as pd

df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
                                   'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
g = df.groupby('A')

DT = dt.Frame(df)

In [5]: df
Out[5]: 
   A    B
0  1  NaN
1  1  2.0
2  2  3.0
3  1  4.0
4  2  5.0

In [6]: DT
Out[6]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     1        2
 2 |     2        3
 3 |     1        4
 4 |     2        5
[5 rows x 2 columns]

without skipping nulls:

In [7]: g.nth(0)
Out[7]: 
     B
A     
1  NaN
2  3.0

In [9]: DT[:, dt.nth(f.B,0), 'A']
Out[9]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

In [10]: DT[:, dt.nth(f.B,1), 'A']
Out[10]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        5
[2 rows x 2 columns]

In [11]: g.nth(1)
Out[11]: 
     B
A     
1  2.0
2  5.0

In [12]: g.nth(-1)
Out[12]: 
     B
A     
1  4.0
2  5.0

In [13]: DT[:, dt.nth(f.B,-1), 'A']
Out[13]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        4
 1 |     2        5
[2 rows x 2 columns]

skipping nulls:

In [14]: g.nth(0, dropna='any')
Out[14]: 
     B
A     
1  2.0
2  3.0

In [21]: g.nth(0, dropna='all')
Out[21]: 
     B
A     
1  NaN
2  3.0

In [42]: DT[~((f[:]==None).rowall()), :][:, dt.nth(f.B, 0), f.A]
Out[42]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

In [43]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 0), f.A]
Out[43]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        3
[2 rows x 2 columns]

In [54]: g.nth(3, dropna='any')
Out[54]: 
    B
A    
1 NaN
2 NaN

In [55]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 3), f.A]
Out[55]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2       NA
[2 rows x 2 columns]

It'd probably wont be a bad idea to implement a dropna function for use in the i section; a better option (which I hope to get to sometime) is to implement methods for ==, !=, >, ... which would make chaining easier, while making the computation clearer and cleaner to the eyes -> DT[~f[:].eq(None).rowany(),:].

That was a digression; my point is in Pandas, they treat nth as per row, we should treat it as per column; if the user wishes something per row, it should go into the i section (and rightly so).

Of course, another option would be via cumcount, and subsequent filtering:

In [51]: DT[:, f[:].extend(dt.cumcount()), f.A]
Out[51]: 
   |     A        B     C0
   | int64  float64  int64
-- + -----  -------  -----
 0 |     1       NA      0
 1 |     1        2      1
 2 |     1        4      2
 3 |     2        3      0
 4 |     2        5      1
[5 rows x 3 columns]

# fetch second row per column
In [52]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==1, :-1]
Out[52]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1        2
 1 |     2        5
[2 rows x 2 columns]

# fetch first row per column
In [53]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==0, :-1]
Out[53]: 
   |     A        B
   | int64  float64
-- + -----  -------
 0 |     1       NA
 1 |     2        3
[2 rows x 2 columns]

Allowing the skipna per column allows us to also extend to first and last implementation to fetch first non-null value, if the user desires that.

samukweku commented 1 year ago

@oleksiyskononenko revisiting the issue of skipna per column or rowwise ^^^^^^^^^^^^^^^

samukweku commented 1 year ago

@oleksiyskononenko, @sh1ng , @st-pasha - need ur help with how to use FExpr_row functions -

Workframe inputs = arg_->evaluate_n(ctx);
      Grouping gmode = inputs.get_grouping_mode();
      colvec columns;
      size_t ncols = inputs.ncols();
      size_t nrows = 1;
      columns.reserve(ncols);
      for (size_t i = 0; i < ncols; ++i) {
        Column col = inputs.retrieve_column(i);
        xassert(i == 0 || nrows == col.nrows());
        nrows = col.nrows();
        columns.emplace_back(col);        
      }

      Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols);

Error message:

src/core/expr/fexpr_nth.cc: In member function ‘dt::expr::Workframe dt::expr::FExpr_Nth<SKIPNA>::evaluate_n(dt::expr::EvalContext&) const’:
src/core/expr/fexpr_nth.cc:77:52: error: cannot call member function ‘virtual Column dt::expr::FExpr_RowAll::apply_function(colvec&&, dt::expr::size_t, dt::expr::size_t) const’ without object
   77 |       Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols);

what is the correct way of using FExpr_RowAll?

samukweku commented 1 year ago

@oleksiyskononenko this is dependent on PR #3404 and #3444. once those PRS are concluded, I can pick up on this

samukweku commented 1 year ago

@oleksiyskononenko figured out how to implement SKIPNA='any' or SKIPNA='all', similar to what pandas implements. If you've got some time to review, after the build completes. thanks

samukweku commented 1 year ago

@oleksiyskononenko just checking in, waiting for your feedback