meltano / sdk

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com
https://sdk.meltano.com
Apache License 2.0
94 stars 67 forks source link

feat: Support Singer Decimal as a config item #1890

Closed s7clarke10 closed 1 day ago

s7clarke10 commented 1 year ago

Feature scope

Taps (catalog, state, stream maps, tests, etc.)

Description

There have been discussions before about whether it is a best practice to emit Decimal, Integer, and Float datatypes as a string or in their native formats.

I have mentioned this in passing in this JSON dicussion https://github.com/meltano/sdk/issues/1046 . The author of msgspec strongly recommends sending numeric data as a string.

There was an informal standard called Singer Decimal where the tap would emit the Numeric data as a string (and if possible) try and turn it back into a numeric format in the target.

You may ask, why would you do this? The key reasons are :

  1. Rounding issues with float datatypes
  2. Inability to handle large numbers or decimal places
  3. Python not handling the numeric data well
  4. The tap does not have enough information in the database schema to hint the correct format. For example in Oracle it is a common practice to specify a datatype of number without supplying a precision or scale. The only accurate way to determine the correct precision and scale in the target is to sample the data. In this situation the only reasonable way to send this data is as a string, and then use a transform tool to cast the data.

For a number of database related taps we have implement a setting called use_singer_decimal. When this is set to true, it will send the data as a string but provides additional information about the scale and precision (if available) in the singer messages. When the target receives the message, if it has this additional singer decimal data, it can cast the data with the correct datatype. The good thing about this approach is that it is a setting, therefore based on your use case, you can emit numeric data in its native numeric format (default setting), or preferable enable the use_singer_decimal setting to gain more accurate numeric data.

We have found by using this methodology, there is no loss of precision. It would be great to introduce this ability into the SDK as it will ensure accurate numeric data being sent.

Examples of enhanced database taps which support this are:

Example enhanced target with support for decoding singer-decimal data:

More detail for tap-db2 written by @mjsqu .

DECIMALs from DB2 are given format property "singer.decimal" in the schema, with type property "number". The numeric scale & precision are stored as schema[additionalProperties][scale_precision], e.g. "additionalProperties":{"scale_precision":"(15,9)"}.

When the tap writes the RECORD message, "singer.decimal" informs the data value to be output into the message as a string The snowflake target obtains the scale/precision from the schema message to create a column of type NUMBER(scale, precision).

edgarrmondragon commented 1 year ago

So, I think there's two parts to this AFAICT.

Support at the tap

This is arguably already support by the typing helpers via CustomType:

jsonschema = PropertiesList(
  Property("my_number", CustomType({"type": "string", "format": "singer.decimal"})),
)

A nicer wrapper could also be added.

Support at the target

to_sql_type could be adapted to map the singer.decimal format to the appropriate SQL column type.

s7clarke10 commented 1 year ago

Hi @edgarrmondragon,

To show an example of my usage of Singer Decimal. I have a working copying along with the usage of msgspec on the traditional singer-python frameworks used by some non-SDK taps. I am seeing some good performance from doing this. https://github.com/s7clarke10/pipelinewise-singer-python.

For testing this I have created a branch on pipelinewise-tap-oracle to use msgspec.

Of note is the use of a boolean config setting use_singer_decimal in the tap. As a user of a tap, I have a choice whether I wish to have decimal and float data emitted as a string - "use_singer_decimal": true (to avoid precision loss), or in their native formats for backwards compatibility. It seems to work well, and produces the same results when I was using orjson.

If that setting is omitted it will out default to outputting decimals and floats in their lossy native format - important for backwards compatibility.

I should note however, there is logic in a tap around the schema definition to define the correct JSON data types when it is discovered so that the records matching the schema.

Here is an example of output with use_singer_decimal set to true.

(pipelinewise-tap-oracle) [ pipelinewise-tap-oracle]$ tap-oracle -c oracle_config_test.json --catalog oracle_properties_test.json
time=2023-08-17 10:12:04 name=singer level=INFO message=Selected streams: ['STEVE-STEVE_TEST'] 
time=2023-08-17 10:12:04 name=singer level=INFO message=No currently_syncing found
time=2023-08-17 10:12:04 name=singer level=INFO message=Beginning sync of stream(STEVE-STEVE_TEST) with sync method(full)
time=2023-08-17 10:12:04 name=singer level=INFO message=Stream STEVE-STEVE_TEST is using full_table replication
time=2023-08-17 10:12:04 name=singer level=INFO message=Singer Decimal Enabled! Floats and Decimals will be output as strings
time=2023-08-17 10:12:04 name=singer level=INFO message=Selected streams: ['STEVE-STEVE_TEST'] 
time=2023-08-17 10:12:04 name=singer level=INFO message=No currently_syncing found
time=2023-08-17 10:12:04 name=singer level=INFO message=Beginning sync of stream(STEVE-STEVE_TEST) with sync method(full)
time=2023-08-17 10:12:04 name=singer level=INFO message=Stream STEVE-STEVE_TEST is using full_table replication
time=2023-08-17 10:12:04 name=singer level=INFO message=Singer Decimal Enabled! Floats and Decimals will be output as strings
{"type":"SCHEMA","stream":"STEVE-STEVE_TEST","schema":{"properties":{"COL1":{"maxLength":100,"type":["null","string"]},"COL2":{"maxLength":50,"type":["null","string"]},"DECIMAL_FIELD":{"format":"singer.decimal","type":["null","string"],"additionalProperties":{"scale_precision":"(10,2)"}},"INTEGER_FIELD":{"type":["null","integer"]},"FLOAT_FIELD":{"format":"singer.decimal","type":["null","string"]}},"type":"object"},"key_properties":[]}
time=2023-08-17 10:12:04 name=singer level=INFO message=dsn: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.131.2.102)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=pndww2db.moh.govt.nz)))
{"type":"STATE","value":{"bookmarks":{"STEVE-STEVE_TEST":{"last_replication_method":"FULL_TABLE","version":1692223924745}},"currently_syncing":"STEVE-STEVE_TEST"}}
{"type":"ACTIVATE_VERSION","stream":"STEVE-STEVE_TEST","version":1692223924745}
time=2023-08-17 10:12:04 name=singer level=INFO message=select SELECT  "COL1" , "COL2" , "DECIMAL_FIELD" , "INTEGER_FIELD" , "FLOAT_FIELD" , NULL as ORA_ROWSCN
                                FROM STEVE.STEVE_TEST
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Decimal Test","COL2":"201.35","DECIMAL_FIELD":"201.35","INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Integer Test","COL2":"438437439439374974943","DECIMAL_FIELD":null,"INTEGER_FIELD":438437439439374974943,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"Float Test","COL2":"3.14159265359","DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":"3.14159265359"},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":null,"COL2":"Hello","DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
{"type":"RECORD","stream":"STEVE-STEVE_TEST","record":{"COL1":"World","COL2":null,"DECIMAL_FIELD":null,"INTEGER_FIELD":null,"FLOAT_FIELD":null},"version":1692223924745,"time_extracted":"2023-08-16T22:12:04.745338Z"}
time=2023-08-17 10:12:04 name=singer level=INFO message=METRIC: b'{"type":"counter","metric":"record_count","value":5,"tags":{"schema":"STEVE","table":"STEVE_TEST"}}'
{"type":"ACTIVATE_VERSION","stream":"STEVE-STEVE_TEST","version":1692223924745}
{"type":"STATE","value":{"bookmarks":{"STEVE-STEVE_TEST":{"last_replication_method":"FULL_TABLE","version":1692223924745,"ORA_ROWSCN":null}},"currently_syncing":null}}
stale[bot] commented 3 weeks ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.