marcboeker / go-duckdb

go-duckdb provides a database/sql driver for the DuckDB database engine.
MIT License
746 stars 110 forks source link

API draft to create and scan data chunks #219

Closed taniabogatsch closed 5 months ago

taniabogatsch commented 6 months ago

After thinking about some of the discussions in #201, I've decided to draft a PR for creating and scanning data chunks in the go-duckdb driver. Maybe most importantly, we should decide on how the dataChunk and vector look, and what functions we want for them. Ideally, all data chunk and vector operations should go through functions in the new files data_chunk.go and vector.go.

Maybe I went in a completely wrong direction here, what do you think @ajzo90 @JAicewizard?

Questions

FIXMEs and future work

Other comments

The function takes in an interface, from which it detects the type of the parameters and return values.

This sounds very similar to some of the type inference in the Appender.

In the design of the Chunk and Vector apis it would be wise to also think of a way we might be able to vectorise the Appender API at the same time. (The scanner type for example looks a lot like something that could be used in the appender).

I only found this comment. Maybe you could point me to the scanner type?

// Use as the `Scanner` type for any composite types (maps, lists, structs)
type Composite[T any] struct {
    t T
}

More future TODOs

// SetColumn sets the column to val, where val is a slice []T. T is the type of the column.
func (chunk *DataChunk) SetColumn(colIdx int, val any) error {
    // TODO
    return errNotImplemented
}

// GetColumn returns a slice []T containing the column values. T is the type of the column.
func (chunk *DataChunk) GetColumn(colIdx int) (any, error) {
    // TODO
    return nil, errNotImplemented
}

// GetColumnData returns a pointer to the underlying data of a column.
func (chunk *DataChunk) GetColumnData(colIdx int) (unsafe.Pointer, error) {
    // TODO
    return nil, errNotImplemented
}
taniabogatsch commented 6 months ago

One more thought is that we could keep the current slower row-at-a-time chunk creation and only consider performance when adding the columnar chunk creation. For the columnar chunk creation, we'll only have to infer the type once and can provide dedicated functions for specific (common) types.

Even the appender could be extended with a function AppendColumn(s).

ajzo90 commented 6 months ago

Thanks. This is definitely in the right direction. Adding some notes from my perspective:

I have a rough extension of sql.Rows that works on vectors that can serve as inspiration for usage. I don't think the interface is worthy for this library since it should return the DataChunk instead, but it shows how I'm using it to fetch vectors currently. I can add some test files to show example usage if you think it make sense.

I found myself implementing a few customised functions (StringVecHash, StringVecDict etc mostly to avoid memory allocations/string conversions). Most of these would not be necessarily when the scalar function UDF if available, but making the data_chunk/vectors easy to extend for similar cases would be nice anyway. Note that my fork primarily considers data types of my own needs, and it does not implement all data types.

Features I like to be supported later:

JAicewizard commented 6 months ago

I think this looks good, although I would prefer to have a different initialisation part, as there is currently appender-centric. Just renaming init to initFromTypes for example. This would allow for a potential initFromDatachunk or similar for UDFs.

Also there is no public API yet right? I think that is the most important part. But this is going in the right direction for sure, I have no real comments at the moment, looks good!

There are still some open problems/questions at the moment, I think mostly the public API.

What I would have something like the following as API:

GetValue(c,r int, val any) error
SetValue(c,r int) (val error)
GetValues(c int) (vals any, error) // vals is always a slice []T where T is the type of the column
SetValues(c, vals any) (val error) // vals should be a slcie []T where T is the type of the column
GetData(c int) unsafe.Pointer
Columns() []ColumnMetaData // ColumnMetaData from my PR, could be moved/copied here

and other generic functions that cant sadly be methods

ColumnGetValue[T](chunk DataChunk, c,r int, val T) error
ColumnSetValue[T](chunk DataChunk, c,r int) (T, error)
ColumnGetValues[T](chunk DataChunk, c int) ([]T, error)
ColumnSetValues[T](chunk DataChunk, c, vals []T) (val error)
ColumnGetData[T](c int) ([]T, error) // Error only if the type cannot be cast

Do you gave any issues/additions with this API? I think it covers most use-cases. I am a little sad about that methods can't have type parameters at this moment, but it is what it is.

Another question would be whether or not we want to support projected columns at the datachunk level. I think there is a benefit to doing this. For table functions this would be very useful to transparently implement projectionpushdown which is very nice for the end-user.

taniabogatsch commented 6 months ago

I have a rough extension of sql.Rows that works on vectors that can serve as inspiration for usage.

Yes, I am conflicted between the special-casing for each type vs. the performance gain. I am not fully sure how Go performs if type casting happens on a columnar level. The current per-row casts could definitely be sped up by moving to more columnar approaches.

Just renaming init to initFromTypes for example. This would allow for a potential initFromDatachunk or similar for UDFs.

Great suggestion!

Also there is no public API yet right? I think that is the most important part.

Yes, I was curious which functions to expose. And you gave the answer haha, I've added some of them, and will add the remaining ones in different commits.

Do you gave any issues/additions with this API? I think it covers most use-cases. I am a little sad about that methods can't have type parameters at this moment, but it is what it is.

No, looks great. I am very sad about that, hahah. But yes, we work with what we have. :)

JAicewizard commented 6 months ago

Yes, I am conflicted between the special-casing for each type vs. the performance gain. I am not fully sure how Go performs if type casting happens on a columnar level. The current per-row casts could definitely be sped up by moving to more columnar approaches.

Good point. We can leave out the generic functions for vector operations until further someone looks into this. I think the bottleneck would still be getting the data pointer from C, but this can be optimized away since it doesn't (really) change. Once this optimization is done we can always look into it and potentially add generic functions.

JAicewizard commented 6 months ago

(generic and/or specialized, applies to both)

JAicewizard commented 5 months ago

What is the progress on this? The merges from main make it difficult to see what changed. I feel like the UDF additions have stalled a bit and are starting to drag (might be partially my fault)

taniabogatsch commented 5 months ago

Hi @JAicewizard, No worries. I can update you on my/our current plan/roadmap for the DataChunk API.

So, I expect it will still take a few weeks until the (table) UDFs are merged, but ideally, we'll have a unified way for handling data chunks in place and ways to speed up performance.

taniabogatsch commented 5 months ago

I'll close this in favor of #240, #237, and #233. Any further (exported) functions can be introduced with the related PRs. Specifically, the next step is to return to and finish #201.