marcboeker / go-duckdb

go-duckdb provides a database/sql driver for the DuckDB database engine.
MIT License
623 stars 97 forks source link

Add Nested Types to the Appender #150

Closed maiadegraaf closed 6 months ago

maiadegraaf commented 7 months ago

I'm currently working on adding nested types to the appended. I haven't finished it yet but I'm opening up a draft for better coordination.

The Big Changes:

@taniabogatsch and I are still ironing out some final changes here

taniabogatsch commented 6 months ago

I just saw that go test still fails like so:

tania@motorbook go-duckdb % go test
--- FAIL: TestAppenderNestedStructMismatch (0.00s)
    appender_nested_test.go:471: 
            Error Trace:    /Users/tania/DuckDB/go-duckdb/appender_nested_test.go:471
            Error:          Received unexpected error:
                            Duckdb has returned an error while appending, all data has been invalidated.
                             Check that the data being appended matches the schema.
                             Error from duckdb: Invalid Input Error: Type mismatch in Append DataChunk and the types required for appender
            Test:           TestAppenderNestedStructMismatch
FAIL
exit status 1
FAIL    github.com/marcboeker/go-duckdb 53.019s
maiadegraaf commented 6 months ago

Thanks for your feedback! I've implemented all the requested changes, so the PR is ready to be merged!

marcboeker commented 6 months ago

@maiadegraaf Thanks for the PR. Just a quick update: I'm currently refactoring this a bit.

marcboeker commented 6 months ago

First iteration of refactorings; more will follow.

taniabogatsch commented 6 months ago

@marcboeker, what motivates removing significant test coverage to have the tests run faster? Especially since these cover the new nested functionality?

marcboeker commented 6 months ago

@marcboeker, what motivates removing significant test coverage to have the tests run faster? Especially since these cover the new nested functionality?

Please review the appender_test.go file, where nested and base cases have been consolidated and refactored (though not yet complete). The only test removed is the nested large test, as I believe the base nested test is sufficient for testing with e.g. 10 rows instead of 10000, without sacrificing the speed of the test suite. Rapid tests are essential for working with them.

Although benchmarks can be helpful, they are not present for the rest of go-duckdb. Therefore, I have removed them. My aim is to create test cases that cover a significant portion of the code, are comprehensible, and can serve as a form of documentation.

There are still missing test cases for the appender, such as UUID.

taniabogatsch commented 6 months ago

Thanks for the explanation.

The only test removed is the nested large test, as I believe the base nested test is sufficient for testing with e.g. 10 rows instead of 10000

The reason for having more than ten rows is that DuckDB's standard vector size is 2048. I see that the primitive types now cover 3k rows, which tests the creation of multiple chunks. I missed that change. Child vectors can exceed the standard vector size for nested types, so testing this might still be interesting.

without sacrificing the speed of the test suite. Rapid tests are essential for working with them.

It is possible to use a regex to run only specific tests, and it is also possible to group slow tests into a common expression to exclude (e.g., _slow_test.go). That way, it is possible to include tests closer to real-usage scenarios while keeping fast tests for rapid development.

This is mostly me trying to understand the project better and better understand the structure for future contributions. Maybe it is worth setting up a section in the contributing guidelines about this?

marcboeker commented 6 months ago

I've done some more refactoring and added tests for the UUID type, time.Time and []byte.

Unfortunately, appending a blob does not work at the moment, hence the TestAppenderBlob is skipped.

@maiadegraaf Do you know what type a DUCKDB_TYPE_BLOB is internally? Is it a primitive type or a list (DUCKDB_TYPE_LIST) of uint8 (which i doubt)?

The current implementation triggers a segmentation violation in the following line:

state = C.duckdb_append_data_chunk(*a.appender, chunk)

The logical type we set is: C.duckdb_create_logical_type(C.DUCKDB_TYPE_BLOB) and the value for the vector row index is set via:

setPrimitive[[]byte](colInfo, rowIdx, val.([]byte))

One weird thing is that the appender is a bit flaky under Ubuntu. Locally on macOS and in the CI under macOS/FreeBSD, everything works reliably. It fails at the same operation as the blob appending:

state = C.duckdb_append_data_chunk(*a.appender, chunk)

taniabogatsch commented 6 months ago

I had a look at the failing test. The reason for the segmentation fault is that we do not set the validity mask correctly for the child vectors of the STRUCT vector. When appending the data, we assume the values are set and try to append a string with an undefined length and an invalid data pointer.

I guess this worked on macos but not on Ubuntu because Ubuntu probably default-initialised that bit to one (valid). In contrast, MacOS probably default-initialised that bit to zero (false)... Or some other bit-level shenanigans.

I have a fix ready, and I'm also working on some other changes. I'll open a PR to @maiadegraaf once I'm done.

I'll also get back to you about the BLOBs.

marcboeker commented 6 months ago

@taniabogatsch Thanks, sounds great. I'm curious to see, how you have fixed it.

taniabogatsch commented 6 months ago

I used the logical type to switch in the SetNull, but I believe that only works for top-level recursion depths. We either have to destroy all child logical types when initializing the types or when closing the Appender. So these types might not be accessible in deeper levels. A cleaner solution here is to revert to the original implementation of this PR, where we return the logical type and end up with exactly one logical type for the top-level chunk initialization. That way, we also don't keep an unused field in the colInfo struct, and we can instead set the DUCKDB_TYPE. I'll push a fix. 🤔

taniabogatsch commented 6 months ago

W.r.t. the BLOB type. DuckDB distinguishes between BLOB and UTINYINT[]. In DuckDB, a BLOB has the same representation as a VARCHAR. Both are a more elaborate char pointer (inlined vs. string heap). It is possible to store Go-string and Go-[]byte in C.CString, and pass the resulting C string to duckdb_vector_assign_string_element. I updated the BLOB implementation accordingly, and I also expanded on the test.

To the best of my knowledge, Go does not distinguish between []byte and []uint8. DuckDB distinguishes between BLOB and UTINYINT[]. @maiadegraaf raised this problematic here. In the current solution, we opt to support BLOB in the Appender, but not UTINYINT[]. I added the respective tests.

cc @marcboeker, what do you think? :)