Closed taniabogatsch closed 5 months ago
One more thought is that we could keep the current slower row-at-a-time chunk creation and only consider performance when adding the columnar chunk creation. For the columnar chunk creation, we'll only have to infer the type once and can provide dedicated functions for specific (common) types.
Even the appender could be extended with a function AppendColumn(s)
.
Thanks. This is definitely in the right direction. Adding some notes from my perspective:
I have a rough extension of sql.Rows
that works on vectors that can serve as inspiration for usage. I don't think the interface is worthy for this library since it should return the DataChunk instead, but it shows how I'm using it to fetch vectors currently. I can add some test files to show example usage if you think it make sense.
I found myself implementing a few customised functions (StringVecHash
, StringVecDict
etc mostly to avoid memory allocations/string conversions). Most of these would not be necessarily when the scalar function UDF if available, but making the data_chunk/vectors easy to extend for similar cases would be nice anyway. Note that my fork primarily considers data types of my own needs, and it does not implement all data types.
Features I like to be supported later:
[Append]Serialize([]byte)([]byte, error)
Deserialize([]byte)([]byte, error)
. Perhaps the C-API can be extended to support duckdbs in-memory serialization? I think this looks good, although I would prefer to have a different initialisation part, as there is currently appender-centric. Just renaming init
to initFromTypes
for example. This would allow for a potential initFromDatachunk
or similar for UDFs.
Also there is no public API yet right? I think that is the most important part. But this is going in the right direction for sure, I have no real comments at the moment, looks good!
There are still some open problems/questions at the moment, I think mostly the public API.
What I would have something like the following as API:
GetValue(c,r int, val any) error
SetValue(c,r int) (val error)
GetValues(c int) (vals any, error) // vals is always a slice []T where T is the type of the column
SetValues(c, vals any) (val error) // vals should be a slcie []T where T is the type of the column
GetData(c int) unsafe.Pointer
Columns() []ColumnMetaData // ColumnMetaData from my PR, could be moved/copied here
and other generic functions that cant sadly be methods
ColumnGetValue[T](chunk DataChunk, c,r int, val T) error
ColumnSetValue[T](chunk DataChunk, c,r int) (T, error)
ColumnGetValues[T](chunk DataChunk, c int) ([]T, error)
ColumnSetValues[T](chunk DataChunk, c, vals []T) (val error)
ColumnGetData[T](c int) ([]T, error) // Error only if the type cannot be cast
Do you gave any issues/additions with this API? I think it covers most use-cases. I am a little sad about that methods can't have type parameters at this moment, but it is what it is.
Another question would be whether or not we want to support projected columns at the datachunk level. I think there is a benefit to doing this. For table functions this would be very useful to transparently implement projectionpushdown
which is very nice for the end-user.
I have a rough extension of sql.Rows that works on vectors that can serve as inspiration for usage.
Yes, I am conflicted between the special-casing for each type vs. the performance gain. I am not fully sure how Go performs if type casting happens on a columnar level. The current per-row casts could definitely be sped up by moving to more columnar approaches.
Just renaming init to initFromTypes for example. This would allow for a potential initFromDatachunk or similar for UDFs.
Great suggestion!
Also there is no public API yet right? I think that is the most important part.
Yes, I was curious which functions to expose. And you gave the answer haha, I've added some of them, and will add the remaining ones in different commits.
Do you gave any issues/additions with this API? I think it covers most use-cases. I am a little sad about that methods can't have type parameters at this moment, but it is what it is.
No, looks great. I am very sad about that, hahah. But yes, we work with what we have. :)
Yes, I am conflicted between the special-casing for each type vs. the performance gain. I am not fully sure how Go performs if type casting happens on a columnar level. The current per-row casts could definitely be sped up by moving to more columnar approaches.
Good point. We can leave out the generic functions for vector operations until further someone looks into this. I think the bottleneck would still be getting the data pointer from C, but this can be optimized away since it doesn't (really) change. Once this optimization is done we can always look into it and potentially add generic functions.
(generic and/or specialized, applies to both)
What is the progress on this? The merges from main make it difficult to see what changed. I feel like the UDF additions have stalled a bit and are starting to drag (might be partially my fault)
Hi @JAicewizard, No worries. I can update you on my/our current plan/roadmap for the DataChunk
API.
DataChunk
. But without the aim to have sufficient support already.
Appender
to support the same column types as the scanner. I'm mostly done; I just need to write a few more tests and open a PR.DataChunk
API to create the getter functions.DataChunk
API with missing functionality for UDFs and faster columnar functions.So, I expect it will still take a few weeks until the (table) UDFs are merged, but ideally, we'll have a unified way for handling data chunks in place and ways to speed up performance.
I'll close this in favor of #240, #237, and #233. Any further (exported) functions can be introduced with the related PRs. Specifically, the next step is to return to and finish #201.
After thinking about some of the discussions in #201, I've decided to draft a PR for creating and scanning data chunks in the go-duckdb driver. Maybe most importantly, we should decide on how the
dataChunk
andvector
look, and what functions we want for them. Ideally, all data chunk and vector operations should go through functions in the new filesdata_chunk.go
andvector.go
.Maybe I went in a completely wrong direction here, what do you think @ajzo90 @JAicewizard?
Questions
Appender
are the function calls via the callbacks and the type casting. @JAicewizard made changes to the vector functions inappender_vector.go
to avoid type casting overhead. How could this be ported to this PR to improve efficiency without breaking theAppender
? See comment here.any
in theAppender
.FIXMEs and future work
Add support for
AppendColumn
or similar to append values in a columnar approach. See @ajzo90's usage here. Here is also a comment w.r.t. adding this.Add support for scanning a data chunk.
Other comments
This sounds very similar to some of the type inference in the
Appender
.I only found this comment. Maybe you could point me to the scanner type?
More future TODOs