Open samukweku opened 1 year ago
Thanks @samukweku. So the issue on this PR is that skipna
doesn't work? Why tests fail with "got an unexpected keyword argument n"?
I guess this implementation has an issue even with no skipna
. Look at the example below
from datatable import dt, f, by
DT = dt.Frame([1, 1, 2])
DT[:, f.C0.nth(0), by(f.C0)]
Triggers
AssertionError: Assertion 'i < nrows()' failed in src/core/column.cc, line 236
UPD: I guess this issue has never been fixed: https://github.com/h2oai/datatable/pull/3346#issuecomment-1259217559 I propose not to close PR's when there is something wrong with your commits, just let me know and we work together on fixing it. Otherwise, we loose a conversation and some issues like the one above.
As for the skipna
, I've been looking into this and it seems that if we just pass the validity mask to Nth_ColumnImpl
we could have issues with negative n
. Because to figure out which row is the row -X
, we first need to establish all the other row ids (skipping missings) and then calculate the X'th row from the end.
So what we need to do is, first, to apply the validity mask to the original frame (filtering out "all" or "any" missings). However, since the rows are already grouped, we may need to adjust the group by...
my bad ... I am lost on the explanation regarding skipna
@oleksiyskononenko /home/sam/datatable/docs/api/fexpr/nth.rst: WARNING: document isn't included in any toctree
where is the toctree?
@samukweku ah, you just need to add nth()
function to the api, just like we do for all the other functions, see index-api.rst
.
@oleksiyskononenko what do you suggest is the way forward for this PR? drop the skipna parameter and let the user handle it instead? Do you mind showing me an implementation for skipna assuming a positive n
, so I can understand how FExpr_all and any would be used
Looking further at pandas' implementation of dropna
, the dropna
is applied to the entire dataframe (including the groupby column), before the nth
function is applied. What we should be doing for nth
and any other relevant function is to apply this per column; if the user wants it per row, that should be applied in the i
section as a filter, same way we'd do if we were removing nulls from the dataframe:
Using the example from pandas' home page
from datatable import dt, f, by
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
g = df.groupby('A')
DT = dt.Frame(df)
In [5]: df
Out[5]:
A B
0 1 NaN
1 1 2.0
2 2 3.0
3 1 4.0
4 2 5.0
In [6]: DT
Out[6]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 1 2
2 | 2 3
3 | 1 4
4 | 2 5
[5 rows x 2 columns]
without skipping nulls:
In [7]: g.nth(0)
Out[7]:
B
A
1 NaN
2 3.0
In [9]: DT[:, dt.nth(f.B,0), 'A']
Out[9]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 2 3
[2 rows x 2 columns]
In [10]: DT[:, dt.nth(f.B,1), 'A']
Out[10]:
| A B
| int64 float64
-- + ----- -------
0 | 1 2
1 | 2 5
[2 rows x 2 columns]
In [11]: g.nth(1)
Out[11]:
B
A
1 2.0
2 5.0
In [12]: g.nth(-1)
Out[12]:
B
A
1 4.0
2 5.0
In [13]: DT[:, dt.nth(f.B,-1), 'A']
Out[13]:
| A B
| int64 float64
-- + ----- -------
0 | 1 4
1 | 2 5
[2 rows x 2 columns]
skipping nulls:
In [14]: g.nth(0, dropna='any')
Out[14]:
B
A
1 2.0
2 3.0
In [21]: g.nth(0, dropna='all')
Out[21]:
B
A
1 NaN
2 3.0
In [42]: DT[~((f[:]==None).rowall()), :][:, dt.nth(f.B, 0), f.A]
Out[42]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 2 3
[2 rows x 2 columns]
In [43]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 0), f.A]
Out[43]:
| A B
| int64 float64
-- + ----- -------
0 | 1 2
1 | 2 3
[2 rows x 2 columns]
In [54]: g.nth(3, dropna='any')
Out[54]:
B
A
1 NaN
2 NaN
In [55]: DT[~((f[:]==None).rowany()), :][:, dt.nth(f.B, 3), f.A]
Out[55]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 2 NA
[2 rows x 2 columns]
It'd probably wont be a bad idea to implement a dropna function for use in the i
section; a better option (which I hope to get to sometime) is to implement methods for ==
, !=
, >
, ... which would make chaining easier, while making the computation clearer and cleaner to the eyes -> DT[~f[:].eq(None).rowany(),:]
.
That was a digression; my point is in Pandas, they treat nth
as per row, we should treat it as per column; if the user wishes something per row, it should go into the i
section (and rightly so).
Of course, another option would be via cumcount, and subsequent filtering:
In [51]: DT[:, f[:].extend(dt.cumcount()), f.A]
Out[51]:
| A B C0
| int64 float64 int64
-- + ----- ------- -----
0 | 1 NA 0
1 | 1 2 1
2 | 1 4 2
3 | 2 3 0
4 | 2 5 1
[5 rows x 3 columns]
# fetch second row per column
In [52]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==1, :-1]
Out[52]:
| A B
| int64 float64
-- + ----- -------
0 | 1 2
1 | 2 5
[2 rows x 2 columns]
# fetch first row per column
In [53]: DT[:, f[:].extend(dt.cumcount()), f.A][f[-1]==0, :-1]
Out[53]:
| A B
| int64 float64
-- + ----- -------
0 | 1 NA
1 | 2 3
[2 rows x 2 columns]
Allowing the skipna per column allows us to also extend to first
and last
implementation to fetch first non-null value, if the user desires that.
@oleksiyskononenko revisiting the issue of skipna per column or rowwise ^^^^^^^^^^^^^^^
@oleksiyskononenko, @sh1ng , @st-pasha - need ur help with how to use FExpr_row
functions -
Workframe inputs = arg_->evaluate_n(ctx);
Grouping gmode = inputs.get_grouping_mode();
colvec columns;
size_t ncols = inputs.ncols();
size_t nrows = 1;
columns.reserve(ncols);
for (size_t i = 0; i < ncols; ++i) {
Column col = inputs.retrieve_column(i);
xassert(i == 0 || nrows == col.nrows());
nrows = col.nrows();
columns.emplace_back(col);
}
Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols);
Error message:
src/core/expr/fexpr_nth.cc: In member function ‘dt::expr::Workframe dt::expr::FExpr_Nth<SKIPNA>::evaluate_n(dt::expr::EvalContext&) const’:
src/core/expr/fexpr_nth.cc:77:52: error: cannot call member function ‘virtual Column dt::expr::FExpr_RowAll::apply_function(colvec&&, dt::expr::size_t, dt::expr::size_t) const’ without object
77 | Column col_out = FExpr_RowAll::apply_function(std::move(columns), nrows, ncols);
what is the correct way of using FExpr_RowAll
?
@oleksiyskononenko this is dependent on PR #3404 and #3444. once those PRS are concluded, I can pick up on this
@oleksiyskononenko figured out how to implement SKIPNA='any' or SKIPNA='all', similar to what pandas implements. If you've got some time to review, after the build completes. thanks
@oleksiyskononenko just checking in, waiting for your feedback
Implement
dt.nth(cols, n)
function to return the nth row (also per group) for the specified columns. Ifn
goes out of bounds, NA-row is returned.Closes #3128