API draft to create and scan data chunks

taniabogatsch commented 1 month ago

After thinking about some of the discussions in #201, I've decided to draft a PR for creating and scanning data chunks in the go-duckdb driver. Maybe most importantly, we should decide on how the dataChunk and vector look, and what functions we want for them. Ideally, all data chunk and vector operations should go through functions in the new files data_chunk.go and vector.go.

Maybe I went in a completely wrong direction here, what do you think @ajzo90 @JAicewizard?

Questions

Which functions should be exposed? This is relevant to support table UDFs as discussed here.
The biggest overhead in the Appender are the function calls via the callbacks and the type casting. @JAicewizard made changes to the vector functions in appender_vector.go to avoid type casting overhead. How could this be ported to this PR to improve efficiency without breaking the Appender? See comment here.
- A TODO for me is to look at this again tomorrow and see how we can avoid any in the Appender.

FIXMEs and future work

Add support for AppendColumn or similar to append values in a columnar approach. See @ajzo90's usage here. Here is also a comment w.r.t. adding this.

I was thinking maybe something with a method SetValues[T Any](data []T) which sets the first len(data) values to the specified ones, and clears the others. But I am open for ideas!
Add support for scanning a data chunk.

Other comments

The function takes in an interface, from which it detects the type of the parameters and return values.

This sounds very similar to some of the type inference in the Appender.

In the design of the Chunk and Vector apis it would be wise to also think of a way we might be able to vectorise the Appender API at the same time. (The scanner type for example looks a lot like something that could be used in the appender).

I only found this comment. Maybe you could point me to the scanner type?

// Use as the `Scanner` type for any composite types (maps, lists, structs)
type Composite[T any] struct {
    t T
}

More future TODOs

GetMetaData, see table UDF PR.
Add all templated functions.
Projection pushdown.
Support for columnar setters and getters. I removed them from this PR for scope.

// SetColumn sets the column to val, where val is a slice []T. T is the type of the column.
func (chunk *DataChunk) SetColumn(colIdx int, val any) error {
    // TODO
    return errNotImplemented
}

// GetColumn returns a slice []T containing the column values. T is the type of the column.
func (chunk *DataChunk) GetColumn(colIdx int) (any, error) {
    // TODO
    return nil, errNotImplemented
}

// GetColumnData returns a pointer to the underlying data of a column.
func (chunk *DataChunk) GetColumnData(colIdx int) (unsafe.Pointer, error) {
    // TODO
    return nil, errNotImplemented
}

taniabogatsch commented 1 month ago

One more thought is that we could keep the current slower row-at-a-time chunk creation and only consider performance when adding the columnar chunk creation. For the columnar chunk creation, we'll only have to infer the type once and can provide dedicated functions for specific (common) types.

Even the appender could be extended with a function AppendColumn(s).

ajzo90 commented 1 month ago

Thanks. This is definitely in the right direction. Adding some notes from my perspective:

I have a rough extension of sql.Rows that works on vectors that can serve as inspiration for usage. I don't think the interface is worthy for this library since it should return the DataChunk instead, but it shows how I'm using it to fetch vectors currently. I can add some test files to show example usage if you think it make sense.

I found myself implementing a few customised functions (StringVecHash, StringVecDict etc mostly to avoid memory allocations/string conversions). Most of these would not be necessarily when the scalar function UDF if available, but making the data_chunk/vectors easy to extend for similar cases would be nice anyway. Note that my fork primarily considers data types of my own needs, and it does not implement all data types.

Features I like to be supported later:

Serialization and deserialization for DataChunk and Vector. [Append]Serialize([]byte)([]byte, error) Deserialize([]byte)([]byte, error). Perhaps the C-API can be extended to support duckdbs in-memory serialization?

JAicewizard commented 1 month ago

I think this looks good, although I would prefer to have a different initialisation part, as there is currently appender-centric. Just renaming init to initFromTypes for example. This would allow for a potential initFromDatachunk or similar for UDFs.

Also there is no public API yet right? I think that is the most important part. But this is going in the right direction for sure, I have no real comments at the moment, looks good!

There are still some open problems/questions at the moment, I think mostly the public API.

What I would have something like the following as API:

GetValue(c,r int, val any) error
SetValue(c,r int) (val error)
GetValues(c int) (vals any, error) // vals is always a slice []T where T is the type of the column
SetValues(c, vals any) (val error) // vals should be a slcie []T where T is the type of the column
GetData(c int) unsafe.Pointer
Columns() []ColumnMetaData // ColumnMetaData from my PR, could be moved/copied here

and other generic functions that cant sadly be methods

ColumnGetValue[T](chunk DataChunk, c,r int, val T) error
ColumnSetValue[T](chunk DataChunk, c,r int) (T, error)
ColumnGetValues[T](chunk DataChunk, c int) ([]T, error)
ColumnSetValues[T](chunk DataChunk, c, vals []T) (val error)
ColumnGetData[T](c int) ([]T, error) // Error only if the type cannot be cast

Do you gave any issues/additions with this API? I think it covers most use-cases. I am a little sad about that methods can't have type parameters at this moment, but it is what it is.

Another question would be whether or not we want to support projected columns at the datachunk level. I think there is a benefit to doing this. For table functions this would be very useful to transparently implement projectionpushdown which is very nice for the end-user.

taniabogatsch commented 1 month ago

I have a rough extension of sql.Rows that works on vectors that can serve as inspiration for usage.

Yes, I am conflicted between the special-casing for each type vs. the performance gain. I am not fully sure how Go performs if type casting happens on a columnar level. The current per-row casts could definitely be sped up by moving to more columnar approaches.

Just renaming init to initFromTypes for example. This would allow for a potential initFromDatachunk or similar for UDFs.

Great suggestion!

Also there is no public API yet right? I think that is the most important part.

Yes, I was curious which functions to expose. And you gave the answer haha, I've added some of them, and will add the remaining ones in different commits.

Do you gave any issues/additions with this API? I think it covers most use-cases. I am a little sad about that methods can't have type parameters at this moment, but it is what it is.

No, looks great. I am very sad about that, hahah. But yes, we work with what we have. :)

JAicewizard commented 1 month ago

Yes, I am conflicted between the special-casing for each type vs. the performance gain. I am not fully sure how Go performs if type casting happens on a columnar level. The current per-row casts could definitely be sped up by moving to more columnar approaches.

Good point. We can leave out the generic functions for vector operations until further someone looks into this. I think the bottleneck would still be getting the data pointer from C, but this can be optimized away since it doesn't (really) change. Once this optimization is done we can always look into it and potentially add generic functions.

JAicewizard commented 1 month ago

(generic and/or specialized, applies to both)

JAicewizard commented 1 month ago

What is the progress on this? The merges from main make it difficult to see what changed. I feel like the UDF additions have stalled a bit and are starting to drag (might be partially my fault)

taniabogatsch commented 4 weeks ago

Hi @JAicewizard, No worries. I can update you on my/our current plan/roadmap for the DataChunk API.

[x] https://github.com/marcboeker/go-duckdb/pull/233 to expose the DataChunk. But without the aim to have sufficient support already.
- [ ] Fixes and nits after merging.
[x] Expand the Appender to support the same column types as the scanner. I'm mostly done; I just need to write a few more tests and open a PR.
[x] Moving the scanner into the same DataChunk API to create the getter functions.
[ ] Expanding the DataChunk API with missing functionality for UDFs and faster columnar functions.

So, I expect it will still take a few weeks until the (table) UDFs are merged, but ideally, we'll have a unified way for handling data chunks in place and ways to speed up performance.

taniabogatsch commented 2 weeks ago

I'll close this in favor of #240, #237, and #233. Any further (exported) functions can be introduced with the related PRs. Specifically, the next step is to return to and finish #201.

marcboeker / go-duckdb

API draft to create and scan data chunks #219