Closed joshuajohananorthos closed 1 week ago
Thanks for all the details, I am able to reproduce. This intel specific is quite strange to me. I even tried to use the native decoder from Microsoft Lib here: https://github.com/denisenkom/go-mssqldb/blob/master/uniqueidentifier.go And it's returning same that Sling returns, but with uppercase instead.
Is it possible to check whether SQL Server is running on Intel via a query? If you could find that, that would be useful, I could put a check & logic in sling to have the bytes-reversed (using https://github.com/tentone/mssql-uuid)
For the Parquet missing nulls issue, I'm able to reproduce as well. Looking in the 3rd party lib code base, not able to find how to specify a nullable column. Whatever nil
value passed is converted to ""
.
Taking a closer look: https://github.com/apache/arrow/tree/main/go/parquet
If you want I can see if I can put together something to fix this. I have not written Go before, but I am a developer so I think I could do something.
I would like to approach it like this:
parse_ms_uuid
that way it can be configured without having to auto determine the type.@flarco What do you think?
Thanks for the offer.
For the MSSQL part, that sounds good. Making it optional is the way to go. The link you shared helped with this insight:
The form that sling uses is the portable-binary-GUID format, which can be mapped back to the native form of the GUID via post-processing. And also, need to maintain consistency for existing pipelines. So having an additional transform called parse_ms_uuid
or something, can maintain the native form of the GUID. That will need to be a manually specified transform (and will overwrite the default parse_uuid
transform that runs today).
If you'd like to write a go test / function to do this, that will be great. You can use https://github.com/tentone/mssql-uuid as a starting point and just create a go test.
For The Parquet null part, it seems the way to go would be to use the Arrow format and write it to a file. I was unable to find a way to write nulls, it seems the libs sling use just convert to empty string, not sure why. ChatGPT cooked up this below code to create nulls, I'll need to refactor some to try this.
package main
import (
"fmt"
"log"
"os"
"github.com/apache/arrow/go/v12/parquet/schema"
"github.com/apache/arrow/go/v12/parquet/types"
)
func main() {
// Create a new schema descriptor
fields := []schema.Node{
// Non-nullable field
schema.NewPrimitiveNode("id", types.Int32, schema.FieldRepetitionType_REQUIRED),
// Nullable field
schema.NewPrimitiveNode("name", types.ByteArray, schema.FieldRepetitionType_OPTIONAL),
}
// Create a GroupNode representing a struct
schemaNode, err := schema.NewGroupNode("example_struct", schema.FieldRepetitionType_REQUIRED, fields...)
if err != nil {
log.Fatal(err)
}
// Create a new schema from the GroupNode
schemaDesc := schema.NewSchema(schemaNode)
// Print the schema
fmt.Println(schemaDesc)
// Optionally, write the schema to a Parquet schema file
if err := schemaDesc.WriteFile("schema.parquet"); err != nil {
log.Fatal(err)
}
}
I created a fork and started on this and I had a few questions before I opened a PR.
To get everything to build I ran scripts\build.sh
. That was failing as it could not find github.com/slingdata-io/sling
. I checked the code and it did not seem to be referenced anywhere so I removed it from go.mod and I was able to build at that point. My question here is did I miss anything? I looked into the github actions yaml files to see the pipeline builds and it seems like they follow the same process around go mod
and they build.
I also tried to run scripts\test.sh
and it fails with the sling cli tests. It looks like I need to setup a Postgres and an Oracle DB to run those tests.
I want to make sure what I am writing builds and passes tests.
Hi, don't worry about passing scripts/test.sh
. I have a private server with many databases running for tests.
If you could write your new test in the core/dbio/database/database_test.go
and run it individually with go test -run TestMsUniqueIdentifier
, that should do it. You wouldn't need a database connection, just provide test values manually via a test-case, see here for example.
I am going to close this as I have created the PR to fix the UUID issue. Did you want me to take a look at the nulls in parquet? If so I will open a different issue for that
@joshuajohananorthos sure, if you're up for it, give a shot! I'm pretty busy these days. Opening a new PR and closing this one sounds good.
This one might have more thorns. You'll have to build the Arrow dataset and dump is as a parquet file (as I mentioned here). My concern is the amount of memory this will take, so please check for that. I'm thinking you'll have to save a parquet row group once it reaches a number of rows / bytes. The main file to change is parquet_arrow.go. And you can write your tests in parquet_arrow_test.go.
Also, ATM datastream.go
use
You can test a large file with this test: https://github.com/slingdata-io/sling-cli/blob/58465e1318065774907b4119332be8dbbfee33d3/core/dbio/filesys/fs_test.go#L241
Just replace dataset1M.csv
with some large csv.
Let me know if you have any questions. You'll have some decent Go exposure by the end of this :).
One more thing, just some feedback, I didn't understand this line: https://github.com/slingdata-io/sling-cli/pull/317/files#diff-431440da462dd7757d28523fd0b174a57ab4e8465aa360d4dcb8567e830ea30aR15
I replaced it with this line: https://github.com/slingdata-io/sling-cli/pull/318/commits/324076d655a0bc14a88a1fda98512b688a07ea67#diff-82c2af3c855b6392e2e293936875cad971581279add73c6c840b0727d78e5ed2R14
Thanks for pointing me towards what to change. I had done some initial research in the codebase, but I will see if I can put this together.
In answer to your other question, a couple of things:
[]uint8
which is why I used it in the test. I was trying to keep the test as close to how it was running live against a real database as I could. Closing this issue as #327 will tackle the parquet problem
I see there is a branch for 1.2.12. If you release that version I could use it today to query uuids out of MS SQL server
Yes, planning to release this weekend, just wrapping up some things.
MSSQL null and UniqueIdentifier issues
Description of the issue: Null values are not being preserved from MSSQL to parquet and UUIDs are not the same value in the database.
Sling version (
sling --version
): 1.2.10Operating System (
linux
,mac
,windows
): linuxReproduction I have setup a repo that completely replicates the issue with a MSSQL server running in docker. https://github.com/joshuajohananorthos/slingdata-sqlserver
Here is the
README.md
from that repo that fully explores the issue.Steps to Reproduce
Start SQL server:
This will create a new database called
TestDB
with a table calledTestTable
and insert 5 rows. Schema forTestTable
:Values inserted into
TestTable
:Run
test.sh
that will create a connection to the database and runsling
with 3 different outputs:json
,csv
, andparquet
.Run
python view_parquet.py
to view the contents of theparquet
file.Observations
json
andcsv
will output theNULL
values correctly. And theparquet
outputs theNULL
values as empty strings.Here is the output of
python view_parquet.py
:Versus the
jsonl
output:So if SQL Server is running on Intel the first 8 bytes are stored as Little Endian. I am assuming this is the source of the issue.