Combination of range() and count() inside update() leads to TypeError

ghost commented 3 years ago

Hi there I want to enumerate the rows in each group.

from datatable import dt, f, by, update, count

df = dt.Frame(x=["b"] * 3 + ["a"] * 3 + ["c"] * 3)

df['a'] = range(df.nrows) # works fine
df[:, update(b=count()), by('x')] # works fine
df[:, update(c=range(count())), by('x')] # TypeError

The last line of code leads to the following error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
TypeError: 'Expr' object cannot be interpreted as an integer

Is there a way fix or circumvent this error to get the desired output (only column 'c' is of importance)?:

   | x    a   b   c
-- + --  --  --  --
 0 | b    0   3   0
 1 | b    1   3   1
 2 | b    2   3   2
 3 | a    3   3   0
 4 | a    4   3   1
 5 | a    5   3   2
 6 | c    6   3   0
 7 | c    7   3   1
 8 | c    8   3   2
[9 rows x 4 columns]

(I asked the same on SO here: https://stackoverflow.com/questions/66585631/python-datatable-assign-new-column-by-group-to-enumerate-elements)

Thank you for your help :)

st-pasha commented 3 years ago

Thanks Peter for the question. The reason range(count()) doesn't work is because range is a pure-python function that expects its argument to be an integer. In this case, count() is not an integer - it is an expression that eventually evaluates to an integer within each group of a DT[i,j,by] call. However, as range() expects a plain integer, it throws an error here.

Unfortunately, I can't think of a way to implement what you're asking -- it may require creating a new function for this task.

st-pasha commented 3 years ago

I've checked some other packages, and here's how they do it:

pandas: cumcount();
data.table: 1:.N or seq_len(.N), both of which are R equivalents of python's range(). However, it works in R because they have lazy evaluation semantics built in;
tidyr: cumsum()
SQL Server: ROW_NUMBER() OVER(PARTITION BY)
Stata: _n variable

ghost commented 3 years ago

Thank you @st-pasha for your explanation very much. I have a way to get the desired output, however the solution isn't elegant at all. Anyway here it is:

from datatable import dt, f, by, update, count

df = dt.Frame(x=["b"] * 3 + ["a"] * 3 + ["c"] * 3)

values = dt.unique(df["x"]).to_list()[0]
for value in values:
    df[f.x == value, 'c'] = range(df[f.x == value, :].nrows)

Result:

   | x    c
-- + --  --
 0 | b    0
 1 | b    1
 2 | b    2
 3 | a    0
 4 | a    1
 5 | a    2
 6 | c    0
 7 | c    1
 8 | c    2

samukweku commented 3 years ago

Yeah, there isn't any clean way without implementing a function. There was a question about this on SO, with a hack

samukweku commented 3 years ago

@st-pasha in terms of API design for something like this, I would suggest maybe one function that can be appended sort of to existing functions, similar to what is available in numpy with ufunc.accumulate

Something like dt.accumulate(dt.sum(f[:])). a similar functionality exists in numpy as reduce.

Not sure if it makes sense or is doable; it just might help to not have so many functions, while still providing the functionality.

Another option would be the user defined function suggested in #1960. That would allow users make use of functions in numpy (which are quite fast and written in C) or other libraries.

st-pasha commented 3 years ago

In the simplest case when there are no groups, dt.sum(f[:]) produces a single row. Thus, dt.accumulate(<one row>) is not going to work. What you want here instead is (accumulate(sum))(f[:]), i.e. a function that augments the existing reduce operator. The other problem is that specific cum* operators can be written more efficiently than what a generic function accumulate can achieve.

samukweku commented 2 years ago

I want to take a stab at this, I am thinking of first implementing a cumsum (which is needed to do the cumcount).

samukweku commented 2 years ago

my knowledge of C++ shows through this; I'll unassign myself; hopefully someone else can pick this up

samukweku commented 2 years ago

this should be resolved by #3279

h2oai / datatable

Combination of range() and count() inside update() leads to TypeError #2892