Closed lostmygithubaccount closed 7 months ago
We don't support this kind of cross backend data loading, not even with two different instances of the same backend.
Don't we already have an issue for this?
If you can add a use case that would also be helpful. It's not trivial to make this work, so having some rationale might help justify the effort.
I just ran into this in getting demo data for #8090, I don't consider that worth this effort. the overall issue is in #8115, which I do think is worth prioritizing. similar to #8426 I think it adds a lot of value to be able to easily (and ideally efficiently) move data cross all the systems Ibis supports for a few use cases:
for this issue, feel free to just close in favor of #8115, though I also think it'd be good to have a better error message here ("Error: cannot transfer data between DuckDB and PySpark backends") -- I'm not sure how difficult detecting and adding that is
for the purposes of the tutorial I'll just write to CSV/Parquet and read from that
Converting to a feature request given the above.
I think it adds a lot of value to be able to easily (and ideally efficiently) move data cross all the systems Ibis supports for a few use cases:
I agree. And also, this is a monstrous problem for anything that doesn't have native arrow support. If we plan to try to recreate odo
we should first check in with the ibis devs who used to work on it, and then second, make a different plan.
limiting to backends that have native arrow support seems fine to me. perhaps w/ exceptions for postgres and sqlite given how common they are for source data into analytics (idk if they support Arrow but I assume not)
(idk if they support Arrow but I assume not)
they do not. we could accomplish this in the short-term by using duckdb's ATTACH
features, although that will make duckdb a dependency of the sqlite and postgres backends (although could be an optional dependency). medium-term, I think we should use ADBC for this.
oh yeah I like using DuckDB for that -- would vote for optional dependency
I think this issue issue is a duplicate of #4800?
Implementation-wise, the common case would be iterating over to_pyarrow_batches()
and inserting each batch in turn (possibly inside a transaction so we can roll it back on failure :shrug:). The trick here would be exposing fast paths for backends like duckdb
that include native support for reading-from/writing-to another backend. AFAICT duckdb
is unique among our backends in this ability (it includes native readers/writes for sqlite
/mysql
/postgres
).
I'd vote to close this in favor of #4800, but with a focus on designing #4800 so we can handle the fast path support in duckdb
.
What happened?
trying to create example data in a PySpark connection and running into errors
repro:
try with
to_pyarrow()
:What version of ibis are you using?
main
What backend(s) are you using, if any?
duckdb + pyspark
Relevant log output
Code of Conduct