Closed ghost closed 2 years ago
Thanks Peter for the question.
The reason range(count())
doesn't work is because range
is a pure-python function that expects its argument to be an integer. In this case, count()
is not an integer - it is an expression that eventually evaluates to an integer within each group of a DT[i,j,by]
call. However, as range()
expects a plain integer, it throws an error here.
Unfortunately, I can't think of a way to implement what you're asking -- it may require creating a new function for this task.
I've checked some other packages, and here's how they do it:
1:.N
or seq_len(.N)
, both of which are R equivalents of python's range()
. However, it works in R because they have lazy evaluation semantics built in;Thank you @st-pasha for your explanation very much. I have a way to get the desired output, however the solution isn't elegant at all. Anyway here it is:
from datatable import dt, f, by, update, count
df = dt.Frame(x=["b"] * 3 + ["a"] * 3 + ["c"] * 3)
values = dt.unique(df["x"]).to_list()[0]
for value in values:
df[f.x == value, 'c'] = range(df[f.x == value, :].nrows)
Result:
| x c
-- + -- --
0 | b 0
1 | b 1
2 | b 2
3 | a 0
4 | a 1
5 | a 2
6 | c 0
7 | c 1
8 | c 2
Yeah, there isn't any clean way without implementing a function. There was a question about this on SO, with a hack
@st-pasha in terms of API design for something like this, I would suggest maybe one function that can be appended sort of to existing functions, similar to what is available in numpy with ufunc.accumulate
Something like dt.accumulate(dt.sum(f[:]))
. a similar functionality exists in numpy as reduce.
Not sure if it makes sense or is doable; it just might help to not have so many functions, while still providing the functionality.
Another option would be the user defined function suggested in #1960. That would allow users make use of functions in numpy (which are quite fast and written in C) or other libraries.
In the simplest case when there are no groups, dt.sum(f[:])
produces a single row. Thus, dt.accumulate(<one row>)
is not going to work. What you want here instead is (accumulate(sum))(f[:])
, i.e. a function that augments the existing reduce operator. The other problem is that specific cum*
operators can be written more efficiently than what a generic function accumulate
can achieve.
I want to take a stab at this, I am thinking of first implementing a cumsum (which is needed to do the cumcount).
my knowledge of C++ shows through this; I'll unassign myself; hopefully someone else can pick this up
this should be resolved by #3279
Hi there I want to enumerate the rows in each group.
The last line of code leads to the following error:
Is there a way fix or circumvent this error to get the desired output (only column 'c' is of importance)?:
(I asked the same on SO here: https://stackoverflow.com/questions/66585631/python-datatable-assign-new-column-by-group-to-enumerate-elements)
Thank you for your help :)