Open MarcoGorelli opened 4 months ago
This is an interesting question. While none the internal plumbing hard-codes a specific data-type (very intentionally), a lot of the transforms were designed to work with pandas or sparse datatypes. They do have a mechanism (single-dispatch) for customising the behaviour with other data types, but it isn't implemented for most transforms. Obviously we could cast back to a pyarrow data type at the end if we wanted to.
At least historically, I'd always viewed arrow as an interchange format, since there were few routines that ran directly on the arrow datastructures themselves. I think this is changing, so I'm totally open to thinking through this more.
Do you have specific use-cases where having the output be an arrow table would make more sense for you?
Thanks for your response!
Do you have specific use-cases where having the output be an arrow table would make more sense for you?
I think if a user passes in Polars, they expect to get back Polars. And as I was looking into preserving the input data class for Polars, I noticed that for PyArrow the input data class isn't preserved
If you're open to it, I could put up a PR demonstrating how Narwhals could work here, as suggested in https://github.com/matthewwardrop/formulaic/issues/160#issuecomment-2232854269? No obligations nor hard feelings if it then gets rejected of course, it just looks like a good use-case (for Polars in particular it would be good to keep things Polars-native if possible...maybe they can also stay lazy, not sure yet)
Hi @MarcoGorelli !
I was toying with Narwhals a bit this morning, and it looks great. I'm still leveling up, but I have most of an implementation working now in Formulaic that can use it as the materialization backend. Given your heavy involvement in Narwhals, I suspect you will know various tricks that I don't, so when I put up a PR soon, I'll let you chime in on it (and feel free at that time to make further contributions :)).
Cool, thanks!
Given your heavy involvement in Narwhals
😄 I'm the original author (maybe I should make that clearer somewhere)
when I put up a PR soon, I'll let you chime in on it
Sounds great! And feel free to join our Discord if you have any question/request which doesn't quite fit into a GitHub issue
If I run the README example with PyArrow input, I get pandas output:
I think I'd have expected
I'm asking in the context of #160 , because there, I think Polars input should probably result in Polars output?