PyArrow input should result in PyArrow output?

matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.

MIT License

346 stars 25 forks source link

PyArrow input should result in PyArrow output? #187

Open MarcoGorelli opened 4 months ago

MarcoGorelli commented 4 months ago

If I run the README example with PyArrow input, I get pandas output:

import pandas as pd
from formulaic import Formula
import pyarrow as pa

df = pa.table({
    'y': [0, 1, 2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

y, X = Formula('y ~ x + z').get_model_matrix(df)

print(y)
print(X)

   y
0  0
1  1
2  2
   Intercept  x[T.B]  x[T.C]    z
0        1.0       0       0  0.3
1        1.0       1       0  0.1
2        1.0       0       1  0.2

I think I'd have expected

pyarrow.Table
y: int64
----
y: [[0,1,2]]
pyarrow.Table
Intercept: double
x[T.B]: int64
x[T.C]: int64
z: double
----
Intercept: [[1,1,1]]
x[T.B]: [[0,1,0]]
x[T.C]: [[0,0,1]]
z: [[0.3,0.1,0.2]]

I'm asking in the context of #160 , because there, I think Polars input should probably result in Polars output?

matthewwardrop commented 3 months ago

This is an interesting question. While none the internal plumbing hard-codes a specific data-type (very intentionally), a lot of the transforms were designed to work with pandas or sparse datatypes. They do have a mechanism (single-dispatch) for customising the behaviour with other data types, but it isn't implemented for most transforms. Obviously we could cast back to a pyarrow data type at the end if we wanted to.

At least historically, I'd always viewed arrow as an interchange format, since there were few routines that ran directly on the arrow datastructures themselves. I think this is changing, so I'm totally open to thinking through this more.

Do you have specific use-cases where having the output be an arrow table would make more sense for you?

MarcoGorelli commented 3 months ago

Thanks for your response!

Do you have specific use-cases where having the output be an arrow table would make more sense for you?

I think if a user passes in Polars, they expect to get back Polars. And as I was looking into preserving the input data class for Polars, I noticed that for PyArrow the input data class isn't preserved

If you're open to it, I could put up a PR demonstrating how Narwhals could work here, as suggested in https://github.com/matthewwardrop/formulaic/issues/160#issuecomment-2232854269? No obligations nor hard feelings if it then gets rejected of course, it just looks like a good use-case (for Polars in particular it would be good to keep things Polars-native if possible...maybe they can also stay lazy, not sure yet)

matthewwardrop commented 3 months ago

Hi @MarcoGorelli !

I was toying with Narwhals a bit this morning, and it looks great. I'm still leveling up, but I have most of an implementation working now in Formulaic that can use it as the materialization backend. Given your heavy involvement in Narwhals, I suspect you will know various tricks that I don't, so when I put up a PR soon, I'll let you chime in on it (and feel free at that time to make further contributions :)).

MarcoGorelli commented 3 months ago

Cool, thanks!

Given your heavy involvement in Narwhals

😄 I'm the original author (maybe I should make that clearer somewhere)

when I put up a PR soon, I'll let you chime in on it

Sounds great! And feel free to join our Discord if you have any question/request which doesn't quite fit into a GitHub issue