MSSQL null and UniqueIdentifier issues

joshuajohananorthos commented 1 month ago

MSSQL null and UniqueIdentifier issues

Description of the issue: Null values are not being preserved from MSSQL to parquet and UUIDs are not the same value in the database.
Sling version (sling --version): 1.2.10
Operating System (linux, mac, windows): linux
Reproduction I have setup a repo that completely replicates the issue with a MSSQL server running in docker. https://github.com/joshuajohananorthos/slingdata-sqlserver

Here is the README.md from that repo that fully explores the issue.

Steps to Reproduce

Start SQL server:

docker compose up --build

This will create a new database called TestDB with a table called TestTable and insert 5 rows. Schema for TestTable:

CREATE TABLE TestTable (
ID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWID(),
NullableString NVARCHAR(1024) NULL,
NonNullableString NVARCHAR(1024) NOT NULL
);

Values inserted into TestTable:

INSERT INTO TestTable (ID, NullableString, NonNullableString)
VALUES 
('3F2504E0-4F89-11D3-9A0C-0305E82C3301', NULL, 'Non-Nullable String 1'),
('3F2504E0-4F89-11D3-9A0C-0305E82C3302', 'Nullable String 2', 'Non-Nullable String 2'),
('3F2504E0-4F89-11D3-9A0C-0305E82C3303', 'Nullable String 3', 'Non-Nullable String 3'),
('3F2504E0-4F89-11D3-9A0C-0305E82C3304', NULL, 'Non-Nullable String 4'),
('3F2504E0-4F89-11D3-9A0C-0305E82C3305', 'Nullable String 5', 'Non-Nullable String 5');

Run test.sh that will create a connection to the database and run sling with 3 different outputs: json, csv, and parquet.
Run python view_parquet.py to view the contents of the parquet file.

Observations

json and csv will output the NULL values correctly. And the parquet outputs the NULL values as empty strings.

Here is the output of python view_parquet.py:

num_row_groups: 1
metadata: <pyarrow._parquet.FileMetaData object at 0x7466687672c0>
  created_by: github.com/parquet-go/parquet-go version 0.20.0(build )
  num_columns: 3
  num_rows: 5
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 485
compression: SNAPPY
<pyarrow._parquet.ParquetSchema object at 0x74666d66ca40>
required group field_id=-1  {
  required binary field_id=-1 ID (String);
  required binary field_id=-1 NullableString (String);
  required binary field_id=-1 NonNullableString (String);
}

pyarrow.Table
ID: string not null
NullableString: string not null
NonNullableString: string not null
----
ID: [["e004253f-894f-d311-9a0c-0305e82c3301","e004253f-894f-d311-9a0c-0305e82c3302","e004253f-894f-d311-9a0c-0305e82c3303","e004253f-894f-d311-9a0c-0305e82c3304","e004253f-894f-d311-9a0c-0305e82c3305"]]
NullableString: [["","Nullable String 2","Nullable String 3","","Nullable String 5"]]
NonNullableString: [["Non-Nullable String 1","Non-Nullable String 2","Non-Nullable String 3","Non-Nullable String 4","Non-Nullable String 5"]]

Versus the jsonl output:

{"id":"e004253f-894f-d311-9a0c-0305e82c3301","nonnullablestring":"Non-Nullable String 1","nullablestring":null}
{"id":"e004253f-894f-d311-9a0c-0305e82c3302","nonnullablestring":"Non-Nullable String 2","nullablestring":"Nullable String 2"}
{"id":"e004253f-894f-d311-9a0c-0305e82c3303","nonnullablestring":"Non-Nullable String 3","nullablestring":"Nullable String 3"}
{"id":"e004253f-894f-d311-9a0c-0305e82c3304","nonnullablestring":"Non-Nullable String 4","nullablestring":null}
{"id":"e004253f-894f-d311-9a0c-0305e82c3305","nonnullablestring":"Non-Nullable String 5","nullablestring":"Nullable String 5"}

The UniqueIdentifier output is not the same as the value being input. I did not notice this before, but when I put this together it appears the sections values are reversed by byte. I investigated this and it appears SQL Server on Intel processors is using Little Endian. And it it not across all the bytes. I found this from the following https://dba.stackexchange.com/questions/121869/sql-server-uniqueidentifier-guid-internal-representation.

Bits	Bytes	Name	Endianness
32	4	Data1	Native
16	2	Data2	Native
16	2	Data3	Native
64	8	Data4	Big

So if SQL Server is running on Intel the first 8 bytes are stored as Little Endian. I am assuming this is the source of the issue.

Section	SQL Server	Parsed UUID
Data1	3F2504E0	E004253F
Data2	4F89	894F
Data3	11D3	D311
Data4	9A0C-0305E82C3301	9A0C-0305E82C3301

flarco commented 1 month ago

Thanks for all the details, I am able to reproduce. This intel specific is quite strange to me. I even tried to use the native decoder from Microsoft Lib here: https://github.com/denisenkom/go-mssqldb/blob/master/uniqueidentifier.go And it's returning same that Sling returns, but with uppercase instead.

Is it possible to check whether SQL Server is running on Intel via a query? If you could find that, that would be useful, I could put a check & logic in sling to have the bytes-reversed (using https://github.com/tentone/mssql-uuid)

For the Parquet missing nulls issue, I'm able to reproduce as well. Looking in the 3rd party lib code base, not able to find how to specify a nullable column. Whatever nil value passed is converted to "". Taking a closer look: https://github.com/apache/arrow/tree/main/go/parquet

joshuajohananorthos commented 1 month ago

If you want I can see if I can put together something to fix this. I have not written Go before, but I am a developer so I think I could do something.

I would like to approach it like this:

Create a test that uses Sling to put the data into an MS SQL Server and then pull it a UniqueIdentifier and as a parquet and check that it works as expected.
Find or create a parse function that outputs the UUIDs correctly. I am thinking name it something like parse_ms_uuid that way it can be configured without having to auto determine the type.

@flarco What do you think?

flarco commented 4 weeks ago

Thanks for the offer.

For the MSSQL part, that sounds good. Making it optional is the way to go. The link you shared helped with this insight:

The form that sling uses is the portable-binary-GUID format, which can be mapped back to the native form of the GUID via post-processing. And also, need to maintain consistency for existing pipelines. So having an additional transform called parse_ms_uuid or something, can maintain the native form of the GUID. That will need to be a manually specified transform (and will overwrite the default parse_uuid transform that runs today).

If you'd like to write a go test / function to do this, that will be great. You can use https://github.com/tentone/mssql-uuid as a starting point and just create a go test.

For The Parquet null part, it seems the way to go would be to use the Arrow format and write it to a file. I was unable to find a way to write nulls, it seems the libs sling use just convert to empty string, not sure why. ChatGPT cooked up this below code to create nulls, I'll need to refactor some to try this.

package main

import (
    "fmt"
    "log"
    "os"

    "github.com/apache/arrow/go/v12/parquet/schema"
    "github.com/apache/arrow/go/v12/parquet/types"
)

func main() {
    // Create a new schema descriptor
    fields := []schema.Node{
        // Non-nullable field
        schema.NewPrimitiveNode("id", types.Int32, schema.FieldRepetitionType_REQUIRED),
        // Nullable field
        schema.NewPrimitiveNode("name", types.ByteArray, schema.FieldRepetitionType_OPTIONAL),
    }

    // Create a GroupNode representing a struct
    schemaNode, err := schema.NewGroupNode("example_struct", schema.FieldRepetitionType_REQUIRED, fields...)
    if err != nil {
        log.Fatal(err)
    }

    // Create a new schema from the GroupNode
    schemaDesc := schema.NewSchema(schemaNode)

    // Print the schema
    fmt.Println(schemaDesc)

    // Optionally, write the schema to a Parquet schema file
    if err := schemaDesc.WriteFile("schema.parquet"); err != nil {
        log.Fatal(err)
    }
}

joshuajohananorthos commented 3 weeks ago

I created a fork and started on this and I had a few questions before I opened a PR.

To get everything to build I ran scripts\build.sh . That was failing as it could not find github.com/slingdata-io/sling. I checked the code and it did not seem to be referenced anywhere so I removed it from go.mod and I was able to build at that point. My question here is did I miss anything? I looked into the github actions yaml files to see the pipeline builds and it seems like they follow the same process around go mod and they build.

I also tried to run scripts\test.sh and it fails with the sling cli tests. It looks like I need to setup a Postgres and an Oracle DB to run those tests.

I want to make sure what I am writing builds and passes tests.

flarco commented 3 weeks ago

Hi, don't worry about passing scripts/test.sh. I have a private server with many databases running for tests. If you could write your new test in the core/dbio/database/database_test.go and run it individually with go test -run TestMsUniqueIdentifier, that should do it. You wouldn't need a database connection, just provide test values manually via a test-case, see here for example.

joshuajohananorthos commented 2 weeks ago

I am going to close this as I have created the PR to fix the UUID issue. Did you want me to take a look at the nulls in parquet? If so I will open a different issue for that

flarco commented 2 weeks ago

@joshuajohananorthos sure, if you're up for it, give a shot! I'm pretty busy these days. Opening a new PR and closing this one sounds good.

This one might have more thorns. You'll have to build the Arrow dataset and dump is as a parquet file (as I mentioned here). My concern is the amount of memory this will take, so please check for that. I'm thinking you'll have to save a parquet row group once it reaches a number of rows / bytes. The main file to change is parquet_arrow.go. And you can write your tests in parquet_arrow_test.go.

Also, ATM datastream.go use

You can test a large file with this test: https://github.com/slingdata-io/sling-cli/blob/58465e1318065774907b4119332be8dbbfee33d3/core/dbio/filesys/fs_test.go#L241 Just replace dataset1M.csv with some large csv.

Let me know if you have any questions. You'll have some decent Go exposure by the end of this :).

flarco commented 2 weeks ago

One more thing, just some feedback, I didn't understand this line: https://github.com/slingdata-io/sling-cli/pull/317/files#diff-431440da462dd7757d28523fd0b174a57ab4e8465aa360d4dcb8567e830ea30aR15

I replaced it with this line: https://github.com/slingdata-io/sling-cli/pull/318/commits/324076d655a0bc14a88a1fda98512b688a07ea67#diff-82c2af3c855b6392e2e293936875cad971581279add73c6c840b0727d78e5ed2R14

joshuajohananorthos commented 2 weeks ago

Thanks for pointing me towards what to change. I had done some initial research in the codebase, but I will see if I can put this together.

In answer to your other question, a couple of things:

When I was debugging, the type passed into the parse function was []uint8 which is why I used it in the test. I was trying to keep the test as close to how it was running live against a real database as I could.
I forget exactly why, but the cast was causing problems with the test. I remember it failing with a cast to string, but I could be wrong.

joshuajohananorthos commented 1 week ago

Closing this issue as #327 will tackle the parquet problem

joshuajohananorthos commented 1 day ago

I see there is a branch for 1.2.12. If you release that version I could use it today to query uuids out of MS SQL server

flarco commented 1 day ago

Yes, planning to release this weekend, just wrapping up some things.

slingdata-io / sling-cli