meltano / sdk

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com
https://sdk.meltano.com
Apache License 2.0
98 stars 70 forks source link

[Feature]: Faster JSON dumps in `format_message` with an alternative JSON SerDe library #1046

Open s7clarke10 opened 2 years ago

s7clarke10 commented 2 years ago

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

Speed of writing messages can add up when you are dumping a large amount of data. One of the things in the original pipelinewise-singer-python was the use of orjson for doing json dumps as it is a lot faster.

The particular line in particular is this.

https://github.com/meltano/sdk/blob/796ccda470529253b5211b70742c814d2b20ec07/singer_sdk/_singerlib/messages.py#L165

We have a variant of pipelinewise-singer-python which uses orjson in favour of json.dumps for improved speed. The code for that section is as follows.

import orjson
import decimal

...

def format_message(message, option=0):
    def default(obj):
        if isinstance(obj, decimal.Decimal):
            return str(obj)
        raise TypeError

    return orjson.dumps(message.asdict(), option=option, default=default)

Sorry the formatting is all out, I couldn't work out how to get the code formatting working. Best look at the example below. https://github.com/mjsqu/pipelinewise-singer-python/blob/2a839d245b92d1ec6d5801a30f5280fa77b22d0c/singer/messages.py#L295

I note that while orjson is faster it is not perfect, for example we need to convert decimal data as a string. There are some other considerations mentioned with the module but if they are not an issue, we have measured a considerable performance gain using this and verified this by the cProfile to look at the time consumed.

edgarrmondragon commented 2 years ago

One of the things in the original pipelinewise-singer-python was the use of orjson for doing json dumps as it is a lot faster.

That's right, although the SDK never used that since it depended on an older version of pipelinewise-singer-python (see https://github.com/meltano/sdk/pull/738) and there was a bug in orjson that was a deal-breaker for us (see https://github.com/ijl/orjson/issues/251). Seems to have been fixed, but pipelinewise-singer-python is stuck with an older version of orjson πŸ€·β€β™‚οΈ.

But, after #979, we can perhaps explore using orjson directly.


Also, there's multiple ways to tackle performance in Singer pipelines, and we've already received feedback about improvements using BATCH messages where SerDe of Singer messages is not a bottleneck.

s7clarke10 commented 2 years ago

That's great feedback @edgarrmondragon, also thanks for taking the time to provide the back story re orjson.

mjsqu commented 1 year ago

This has been updated in my fork of pipelinewise-singer-python:

Method is BatchMessage.format_message

def format_message(message, option=0):
    def default(obj):
        if isinstance(obj, decimal.Decimal):
            return int(obj) if float(obj).is_integer() else float(obj)
        raise TypeError

    return orjson.dumps(message.asdict(), option=option, default=default)

https://github.com/mjsqu/pipelinewise-singer-python/commit/e31c16895c8900dad0bd96b98c43f9a5c6137905

mjsqu commented 1 year ago

Testing a half-million (500,000) row table using a branch in my tap-db2 variant - a rewrite using the Meltano SDK, the cProfile results show significant time devoted to writing record messages:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    948/1    0.023    0.000  215.936  215.936 {built-in method builtins.exec}
        1    0.000    0.000  215.936  215.936 cprofile_db2_sdk.py:2(<module>)
        1    0.000    0.000  215.254  215.254 core.py:1128(__call__)
        1    0.000    0.000  215.254  215.254 core.py:987(main)
        1    0.000    0.000  215.253  215.253 core.py:1393(invoke)
        1    0.000    0.000  215.253  215.253 core.py:709(invoke)
        1    0.000    0.000  215.253  215.253 tap_base.py:442(cli)
        1    0.001    0.001  214.917  214.917 tap_base.py:402(sync_all)
        1    0.312    0.312  214.271  214.271 core.py:1159(sync)
   532966    3.020    0.000  213.958    0.000 core.py:1041(_sync_records)
   532965    1.301    0.000  189.948    0.000 core.py:856(_write_record_message)
  1065930    4.345    0.000  165.491    0.000 core.py:823(_generate_record_messages)

The speed of the same table in seconds for Meltano SDK vs tap-db2 1.0.4:

SDK - 126.02 seconds 1.0.4 - 39.40 seconds

@edgarrmondragon - orjson could be a huge benefit when it comes to database taps especially as they usually have large volumes and run significant numbers of streams.core._write_record_message calls.

WillDaSilva commented 1 year ago

While we're considering alternatives for JSON reading/writing in Python, I'll call out msgspec.

Its benchmarks show it being significantly more performant than orjson. It's also a smaller library.

I haven't personally used it, but it (along with others) are worth considering before we choose to use a third-party library to handle JSON writing in the SDK.

edgarrmondragon commented 1 year ago

While we're considering alternatives for JSON reading/writing in Python, I'll call out msgspec.

Its benchmarks show it being significantly more performant than orjson. It's also a smaller library.

I haven't personally used it, but it (along with others) are worth considering before we choose to use a third-party library to handle JSON writing in the SDK.

Latest msgspec only seems to support Python 3.8+, so if we decide for it then it'd be good to follow https://github.com/meltano/sdk/issues/1659 with this.

orjson also supports supports a couple more platforms, but they are probably not a big deal and msgspec is explicitly skipping wheels for those.

mjsqu commented 1 year ago

Having updated the sdk in my own fork to use orjson: https://github.com/mjsqu/sdk/tree/feature/orjson_write_messages/singer_sdk

I can see some speed benefits, however the simplejson module has an advantage in terms of serialization of the decimal.Decimal type, it simply outputs the full value of the decimal.Decimal to the json, without wrapping in quotes:

simplejson

simplejson has an option called use_decimal=True which when used, outputs the full value of the number:

>>> simplejson.dumps(decimal.Decimal("1111111111111.333333333333333333"),use_decimal=True)
'1111111111111.333333333333333333'

Digging down into the module code, this is because the core encoder.py uses str() when writing out the serialized object, not before, (so the decimal.Decimal remains as that type for longer - beginning to confuse myself here tbh πŸ˜• ):

https://github.com/simplejson/simplejson/blob/master/simplejson/encoder.py#L521-L522

            elif _use_decimal and isinstance(value, Decimal):
                yield buf + str(value)

And str on a decimal.Decimal keeps all the precision:

>>> str(decimal.Decimal("1111111111111.333333333333333333"))
'1111111111111.333333333333333333'

orjson

Without handler

orjson does not handle decimal.Decimal types natively:

>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Type is not JSON serializable: decimal.Decimal

With handler

Implement a default function, as described in the orjson README.md

>>> import orjson,decimal
>>> def default(obj):
...     if isinstance(obj, decimal.Decimal):
...         return str(obj)
...     raise TypeError
...
>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"),default=default)
b'"1111111111111.333333333333333333"'

The result is wrapped in quotes and would end up in JSON as:

"record":{"decm_scale": "1111111111111.333333333333333333"}

At this point, we could try to use float to handle decimal.Decimal, however precision is lost:

>>> import orjson,decimal
>>> def default(obj):
...     if isinstance(obj, decimal.Decimal):
...         return float(obj)
...     raise TypeError
...
>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"),default=default)
b'1111111111111.3333'

msgspec

msgspec appears to handle decimal.Decimal out-of-the-box, however it is still setting it to a string:

>>> msgspec.json.encode(decimal.Decimal("1111111111111.333333333333333333"))
b'"1111111111111.333333333333333333"'

msgspec allows the use of handlers but, again, you are left with the choice of Python native types which either stringify or lose precision.

Conclusion

Using msgspec or orjson it isn't possible to exactly mirror the results of simplejson in terms of the RECORD messages:

module resulting RECORD key:value
simplejson {"decm_scale": 1111111111111.333333333333333333}
orjson or msgspec {"decm_scale": "1111111111111.333333333333333333"}

Further reading on SO: https://stackoverflow.com/questions/1960516/python-json-serialize-a-decimal-object

@edgarrmondragon @WillDaSilva - I guess the questions for these two improved modules are:

  1. Can they handle bigger decimal types in some way I'm not seeing?
  2. How important is it that the values come through as numeric values in the JSON messages? - I know we have heard from other users in the community that this sort of thing can fail JSONSchema validation
edgarrmondragon commented 1 year ago

@mjsqu thanks for that amazing deep dive!

I think we're gonna have to live with the loss of precision πŸ˜•. Probably very few taps are de-serializing number values in API responses to decimal.Decimal, or at least we don't do that in the SDK. We're de-serializing as float and thus serializing to valid JSON number values, so at least schema validation has not been a concern.

In order not to break taps that do rely on use_decimal=True serialization, we'd need to conform decimal types to float as part of _conform_primitive_property: https://github.com/meltano/sdk/blob/dfcec5fb4f26329f6f6963626b51003296240e0d/singer_sdk/helpers/_typing.py#L469

With that in mind, we'd sacrifice numeric precision for a minority of taps for a gain in performance.

Also, I really like msgspec's union de-serialization and it's close to what I tried to go for in a Rust implementation of the Singer message spec:

import typing as t
import msgspec

class Message(msgspec.Struct, tag_field="type", tag=str.upper):
    pass

class Record(Message):
    stream: str
    record: dict[str, t.Any]

class Schema(Message):
    stream: str
    schema: dict[str, t.Any]

msg = Record("users", {"id": 1})
buf = msgspec.json.encode(msg)
roundtrip = msgspec.json.decode(buf, type=t.Union[Record, Schema])

assert isinstance(roundtrip, Record)
BuzzCutNorman commented 1 year ago

@edgarrmondragon I ran across rapidjson it isn't as fast as msgspec it is close to ujson. The neat thing is it handles decimals.Decimals just like simplejson.

rapidjson

>>> import decimal
>>> import rapidjson
>>> 
>>> KEY = "some_stuff"
>>> VALUE= decimal.Decimal("1111111111111.333333333333333333")
>>> 
>>> record={KEY: VALUE}
>>> print(f"{KEY}: {VALUE} {type(VALUE)}\n")
some_stuff: 1111111111111.333333333333333333 <class 'decimal.Decimal'>

>>> fast_and_precise = rapidjson.NM_NATIVE | rapidjson.NM_DECIMAL | rapidjson.NM_NAN
>>> iso_datetime = rapidjson.DM_ISO8601
>>> rd = rapidjson.dumps(record, number_mode=fast_and_precise, datetime_mode=iso_datetime)
>>> print(f"{rd}\n")
{"some_stuff":1111111111111.333333333333333333}

>>>

You can see rapidjson performance numbers here.

https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html

https://github.com/ijl/orjson#performance

s7clarke10 commented 1 year ago

I would really like to advocate for accurate movement of decimal data. We can't assume we understand the usage of the data and whether the accuracy of the scale or precision is important. I would from a trust of the overall Meltano eco-system, taps and targets advocated accuracy over speed.

It is for this reason that we are using Singer.Decimal notation to ensure we don't get rounding in our data decimal and float datatypes.

If rapidjson is a main stream library going forward and it support decimal correctly unlike orjson I would advocate picking that. Alternatively I wonder if the maintainers of orjson would be open to a PR to fully support decimal data?

s7clarke10 commented 1 year ago

Oh, the other thought is I wonder if msgspec would be open to supporting the Decimal Data without the double quotes?

There seems to be some discussion about this as an Issue on the msgspec repo. https://github.com/jcrist/msgspec/issues/349

BuzzCutNorman commented 1 year ago

If rapidjson is a main stream library going forward and it support decimal correctly unlike orjson I would advocate picking that.

I really like how python-rapidjson works but the base rapidjson repo that it is a wrapper for hasn't seen a version change since 2016 even though there are commits made recently. That makes me feel uneasy. I worry the lack of versioning makes pegging to a particular version in poetry useless and could lead to disaster if a change to rapidjson introduces an issue for SDK taps and targets.

I would from a trust of the overall Meltano eco-system, taps and targets advocated accuracy over speed.

I agree with this 100 percent.

Alternatively I wonder if the maintainers of orjson would be open to a PR to fully support decimal data?

There have been issues raised about adding decimals to orjson. Here is a link I found.

https://github.com/ijl/orjson/issues/172: Feature request for dumps decimal

It looks like there was a PR but it was not merged because no clear decision was made by the authors of the PR whether to serializing as a string or a C or Rust decimal.

Oh, the other thought is I wonder if msgspec would be open to supporting the Decimal Data without the double quotes?

I am having a hard time understanding how to use msgspec in the SDK. It wants everything to be in a struct that gives it python type hints for all the elements. it looks like this is needed for both encoding and decoding which to me means it is not compatible with jsonschema and seems to act more like a replacement.
https://jcristharif.com/msgspec/supported-types.html#decimal

>>> import decimal

>>> x = decimal.Decimal("1.2345")

>>> msg = msgspec.json.encode(x)

>>> msg
b'"1.2345"'

>>> msgspec.json.decode(msg, type=decimal.Decimal)
Decimal('1.2345')

>>> msgspec.json.decode(b'"oops"', type=decimal.Decimal)
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
msgspec.ValidationError: Invalid decimal string

I am still reading and playing with msgspec and my thoughts on this might change as I learn more.

jcrist commented 1 year ago

Hi, msgspec author here, following the link back here from https://github.com/jcrist/msgspec/issues/440.

It wants everything to be in a struct that gives it python type hints for all the elements. it looks like this is needed for both encoding and decoding which to me means it is not compatible with jsonschema and seems to act more like a replacement.

This isn't correct. If no types are specified (the default), msgspec.json.decode works like any other JSON library in python, returning a mix of dict/list/int/float/str/bool/None objects:

In [1]: import msgspec

In [2]: msgspec.json.decode('{"hello": 1, "world": 2}')  # works the same as any other json.loads
Out[2]: {'hello': 1, 'world': 2}

In [3]: msgspec.json.encode(["no", "need", "to", "use", "structs"])  # works the same as any other json.dumps
Out[3]: b'["no","need","to","use","structs"]'

There's no need to use structs for typed decoding if you don't want to.

The highest performance on the decoding side will come if you use msgspec for both decoding & validation, but you can also use msgspec just for serialization, same as you would any other json library in python.

If you have any suggestions for how we could make that clearer in our docs, I'd appreciate them. This is covered both on the main page and in our usage docs, but since there was a misunderstanding here clearly we can do better :).

I can see some speed benefits, however the simplejson module has an advantage in terms of serialization of the decimal.Decimal type, it simply outputs the full value of the decimal.Decimal to the json, without wrapping in quotes

I'm not familiar with meltano and the needs of this project, but I'll note that most JSON libraries (in any language) don't support access to the raw serialized string value when decoding JSON numbers, which means few libraries will be able to decode a decimal value serialized as a number instead of a string without losing precision. If cross-language compatibility and avoiding precision loss is important for this project, I'd recommend considering moving to serializing decimal values as strings if possible. See also my response here: https://github.com/jcrist/msgspec/issues/440#issuecomment-1588066782.


Anyway, no pressure to use msgspec over any other tool. If you decide it may be a good fit, I'm always happy to answer questions or provide tips.

BuzzCutNorman commented 1 year ago

I have been playing with the three JSON libraries mentioned in this issue and thought it might be good to publish my somewhat working versions of the SDK utilizing them as draft PRs. I promise all three completed pytest locally (ok except the S3 test).

I have been doing some testing locally with tap-stackoverflow-sampledata and tap-mssql--buzzcutnorman extracting the StackOverflow postlinks file data which has 8 million rows in it. To run both taps utilizing the SDK I created a test Meltano project and installed them. Then activated the taps venv and upgraded the SDK to a local version that utilizes one of the JSON libraries. The whole process looks like this.

meltano install extractor tap-stackoverflow-sampledata --clean
.\.meltano\extractors\tap-stackoverflow-sampledata\venv\Scripts\activate.ps1 
pip install --upgrade -e ..\..\MeltanoContribute\sdk\   
deactivate

I then run meltano invoke tap-mssql | Out-Null or meltano invoke tap-stackoverflow-sampledata | Out-Null and note the sync_duration. Below are the results. Changing out the JSON libraries in format_message lead to some time gains.

tap simplejson msgspec orjson rapidjson
tap-mssql 960.95 908.82 884.73 946.28
tap-stackoverflow-sampledata 1009.45 847.67 829.91 866.97

Here are some notes I have on all three.

msgspec

Msgpsec requires python 3.8 or greater. I bumped the lowest version of python in the pyproject.toml to 3.8 so I could add it. The json.encode generates a bytes instead of a str. If you run a .decode() before returning everything works. One nice touch was the json.format and option indent=0 which returns JSON on one line and adds a single space between items. This matches up with json.dumps. Tap-sqllite tests failed along with invoking tap-mssql. Both taps returned the following error:

TypeError: Only dicts with str-like or int-like keys are supported

Tracked this down to the fact SQLAlchemy results return the type <class 'sqlalchemy.sql.elements.quoted_name'> for the column names. This can be corrected in post_process by converting all keys to str. I was able to put the default to str as before with enc_hook=str. I don't have any handling for decimals currently. They are going out as strings which is causing JSONSchema validation errors if you try a meltano run to a target. There is a nice way to append \n to rows I just haven't gotten that in place yet.

orjson

Adding orjson was uneventful. It also returns a bytes instead of a str but just like with mspec if you do a .decode() before you return everything works ok. One thing of note is the JSON returned has no spaces which is different from the base json or simplejson library. There were a number of tests that check against JSON strings which had to be updated. I was able to put the default to str as before. I do not have any handling of decimals in place so they are going out as str at the moment. Which will lead to the same JSONSchema validation error on the target side like I mentioned in my msgspec write up. There is a nice way to append \n to the JSON string which I did put in place. I did run into issues with non str keys like I did with mspec but was able to get around it by added the json.OPT_NON_STR_KEYS option.

rapidjson

Adding rapidjson was uneventful. The main issue is python doesn't seem to understand a thing about it. No autocomplete or info on methods in VSCode. It runs and works but I miss the visual cues. A str is returned by the .dumps method which is nice. All the whitespace has been removed from the JSON string just like orjson. There were a number of tests that check against JSON strings which had to be updated. I was able to put the default to str as before. The big win with rapidjson is support for decimals just like simplejson.

WillDaSilva commented 1 year ago

Thanks for the informative writeup @BuzzCutNorman!

Msgpsec requires python 3.8 or greater.

Worth mentioning that the SDK will be dropping support for Python 3.7 soon, since Python 3.7 reaches end-of-life on 2023-06-27:

jcrist commented 1 year ago

A few notes on your msgspec implementation:

One nice touch was the json.format and option indent=0 which returns JSON on one line and adds a single space between items.

While this works, it does have a performance cost. If you're trying to optimize JSON performance you likely don't want whitespace anyway, and comparing msgspec with the extra format call to other libraries without it unnecessarily hampers msgspec's performance.

Tracked this down to the fact SQLAlchemy results return the type <class 'sqlalchemy.sql.elements.quoted_name'> for the column names. This can be corrected in post_process by converting all keys to str.

I'd be happy to push a fix upstream for this, supporting string subclasses as keys in a dict seems like a fine thing to support builtin. I've opened https://github.com/jcrist/msgspec/issues/450 to track this.

I don't have any handling for decimals currently. They are going out as strings which is causing JSONSchema validation errors if you try a meltano run to a target.

As I mentioned above, I highly recommend using strings to represent decimal types, as encoding them as numbers makes it harder for downstream tools to decode without losing precision. That said, we have an open issue for this, and if it's a blocker for meltano's use case I'd be happy to push up a fix (it's pretty simple to support).

BuzzCutNorman commented 1 year ago

I'd be happy to push a fix upstream for this, supporting string subclasses as keys in a dict seems like a fine thing to support builtin. I've opened https://github.com/jcrist/msgspec/issues/450 to track this.

@jcrist Thank you πŸ™ for creating that case I will update it with the info I have.

BuzzCutNorman commented 1 year ago

One nice touch was the json.format and option indent=0 which returns JSON on one line and adds a single space between items.

While this works, it does have a performance cost. If you're trying to optimize JSON performance you likely don't want whitespace anyway, and comparing msgspec with the extra format call to other libraries without it unnecessarily hampers msgspec's performance.

Here are the updated times for msgspec with the json.format step removed.

tap simplejson msgspec orjson rapidjson
tap-mssql 960.95 802.05 884.73 946.28
tap-stackoverflow-sampledata 1009.45 825.81 829.91 866.97

Comparision between the two msgspec times with and without json.format

msgspec

tap with format with out format difference
tap-mssql 908.82 802.05 106.77
tap-stackoverflow-sampledata 847.67 825.81 21.86
BuzzCutNorman commented 1 year ago

I'd be happy to push a fix upstream for this, supporting string subclasses as keys in a dict seems like a fine thing to support builtin. I've opened https://github.com/jcrist/msgspec/issues/450 to track this.

This has been resolved in https://github.com/jcrist/msgspec/pull/454. If you want to try it out before the next release, you can install the main branch from GitHub.

@jcrist Thank You πŸ™for getting this added so quickly. I did some testing this weekend and more this morning and it is working great. No need to convert all keys to str in the post_process method. I will remove that code from the msgspec PR today.

BuzzCutNorman commented 1 year ago

Some of you might have already known this but I just learned we don't need to run .decode() on the bytes returned by msgspec or orjson if we write() the results of format_message to sys.stdout.buffer. I didn't see anything in the Singer Spec that says the messages had to be str type. I did check and the listen and _process_lines methods in the SingerReader class are both capable of dealing with bytes. Though it seems the lines from file_input get converted to str sometime before they hit json.loads.

BuzzCutNorman commented 1 year ago

I was wondering If there is a reason simplejson was not put into the singer_sdk.io_base.SingerReader._process_lines() method to match its use in the singer_sdk._singerlib.format_message() function?

jcrist commented 1 year ago

As I mentioned above, I highly recommend using strings to represent decimal types, as encoding them as numbers makes it harder for downstream tools to decode without losing precision. That said, we have an open issue for this, and if it's a blocker for meltano's use case I'd be happy to push up a fix (it's pretty simple to support).

This issue has now been resolved (PR here). I still recommend encoding decimal values as strings for cross-tool compatibility, but the upcoming msgspec release will include support for encoding and decoding decimal.Decimal values to/from JSON numbers instead of just strings.

BuzzCutNorman commented 1 year ago

I just updated the SDK PR #1784 with msgspec to use version 0.17.0 which has all the fixes and features mentioned above. The format_message function is using enc_hook=str (which converts all unknown datatypes to string) and decimal_format="number" (which returns beautifully formatted decimal numbers). Please give it a look or try and let me know what you think.

A big THANKS πŸ™to @jcrist

BuzzCutNorman commented 1 year ago

Does anyone have any objections to me closing the SDK PRs for orjson and rapidjson?

edgarrmondragon commented 1 year ago

Does anyone have any objections to me closing the SDK PRs for orjson and rapidjson?

I was wondering If there is a reason simplejson was not put into the singer_sdk.io_base.SingerReader._process_lines() method to match its use in the singer_sdk._singerlib.format_message() function?

Just seeing this. The answer is simply oversight πŸ˜….

Does anyone have any objections to me closing the SDK PRs for orjson and rapidjson?

@BuzzCutNorman None at all, I think folks (including myself) in this discussion feel more inclined to further investigate msgspec as our serde library.

BuzzCutNorman commented 1 year ago

@edgarrmondragon Cool, I will delete those draft PRs tomorrow. Thanks, for answering the question about simplejson and singer_sdk.io_base.SingerReader._process_lines(). Should I add msgspec to singer_sdk.io_base.SingerReader._process_lines() in the SDK PR with msgspec?

s7clarke10 commented 1 year ago

T

@edgarrmondragon Cool, I will delete those draft PRs tomorrow. Thanks, for answering the question about simplejson and singer_sdk.io_base.SingerReader._process_lines(). Should I add msgspec to singer_sdk.io_base.SingerReader._process_lines() in the SDK PR with msgspec?

Thank you for your work on this @BuzzCutNorman

edgarrmondragon commented 1 year ago

Should I add msgspec to singer_sdk.io_base.SingerReader._process_lines() in the SDK PR with msgspec?

@BuzzCutNorman Yeah, I'm very curious if there's gains in deserialization too. Although I don't think it's a blocker for the serialization side, so feel free to add it or punt it as you see fit 😁.

BuzzCutNorman commented 1 year ago

@jcrist I have been working on adding msgspec into the SDK SingerReader.deserialize_json method. It was easy to add in as a replacement. The only functionality I have not been able to reproduce is the parse_float option of json.loads. Currently json.loads in the deserialize_json method looks like this

            return json.loads(  # type: ignore[no-any-return]
                line,
                parse_float=decimal.Decimal,
            )

I setup a dec_hook function to attempt changing float to decimal.Decimal. It didn't work but to be honest I was not expecting it would not since the msgspec documentation sounded like dec_hook is only called if it is a non-supported data type. I mention it now for full disclosure of what I have tried and also in case this should have worked, and I just had my code wrong.

def dec_hook(type: t.Type, obj: t.Any) -> t.Any:
     # `type` here is the value of the custom type annotation being decoded
     if type is float:
         return decimal.Decimal(obj)
     else:
         try:
             return str(obj)
         except:
             # Raise a NotImplementedError for other types
             raise NotImplementedError(f"Objects of type {type} are not supported")

I looked at convert but it didn't seem to me like it was meant for this scenario. I was wondering if you had any suggestions on how I might reproduce parse_float functionality using msgspec.

Here is an example:

>>> import decimal, json, msgspec
>>> encoder = msgspec.json.Encoder(enc_hook=str, decimal_format="number")
>>> decoder = msgspec.json.Decoder(dec_hook=str)
>>> x = decimal.Decimal('1.3')    
>>> extract = {"id": 1, "column1": x}
>>> line = encoder.encode(extract)
>>> print(line)
b'{"id":1,"column1":1.3}'
>>> json.loads(line, parse_float=decimal.Decimal)
{'id': 1, 'column1': Decimal('1.3')}
>>> decoder.decode(line)
{'id': 1, 'column1': 1.3}

Using Structs I can generate a dict that is identical to the parse_float example. I am using msgspec.defstruct since the data format is unknown until we get a jsonschema message. We would need to find a way to covert the jsonschema into an iterable item to pass to msgspec.defstruct to create a Sturct to use during the decode process. Here is an extension of the example.

>>> decoder_ring = msgspec.defstruct('myStream',[("id", int), ("column1", decimal.Decimal)])            
>>> msgspec.structs.asdict(msgspec.json.decode(line, type=decoder_ring))
{'id': 1, 'column1': Decimal('1.3')}
BuzzCutNorman commented 1 year ago

I was wondering if we should change the title of this issue from "[Feature]: Faster JSON dumps in format_message with orjson " to "[Feature]: Faster JSON dumps in format_message with msgspec"

jcrist commented 1 year ago

The only functionality I have not been able to reproduce is the parse_float option of json.loads

There currently isn't a clean way of doing this, but it's a feature I'd be happy to add. I've opened https://github.com/jcrist/msgspec/issues/507 to track this. I'd expect this to be a fairly quick feature to add, should be able to get it out with the upcoming release next week.

Using Structs I can generate a dict that is identical to the parse_float example. I am using msgspec.defstruct since the data format is unknown until we get a jsonschema message.

If the decoding side knows the expected schema (either statically, or dynamically as you've mentioned here), then parsing into a msgspec.Struct object will be the fastest way to decode assuming the consumers can work with the struct directly. Otherwise you might want to decode into a TypedDict (which is a dict at runtime). Note that this also does type validation of the inputs, which would then let downstream consumers avoid doing unnecessary type checking.

If knowing the schema beforehand (or converting the schema into python type annotations) is tricky, then using the parse_float option as described above should be doable and would match your current behavior. This will only be a little bit slower than parsing into a struct or typed dict, but won't do any type validation.

BuzzCutNorman commented 1 year ago

That is funny I was just writing an update to say I got a kind of working SingerMessageType.SCHEMA, which are jsonschema formatted, to Struct method which during the process is typing all "type":"number" fields to Decimal via the Struct.

jhogendorn commented 1 year ago

I thought I'd just drop this in here for peoples awareness since it hasnt been raised so far.

https://github.com/simdjson/simdjson

As far as I'm aware, simdjson is by far the most performant json library.

BuzzCutNorman commented 1 year ago

Looking at simdjson there are three projects that make it useable in Python. The are:

libpy_simdjson The project has not been updated since October 2020.

cysimdjson The project has been recently updated and seems to be very active. Given the note on performance I don't know if it will be a good fit. I really appreciate that they added this statement right up front.

Trade-offs


The speed of cysimdjson is based on these assumptions:

  1. The output of the parser is read-only, you cannot modify it
  2. The output of the parser is not Python dictionary, but lazily evaluated dictionary-like object
  3. If you convert the parser output into a Python dictionary, you will lose the speed

If your design is not aligned with these assumptions, cysimdjson is not a good choice.

pysimdjson This is a recently updated project that has good documentation. I did find that it currently does not allow for decimals only floats. This is an issue that has been in place since August of 2020. The main simdjson project needs to make a change before this can be added by cysimdjson.

s7clarke10 commented 4 months ago

Hi, I have fully implemented msgspec support (including singer_decimal) updated and supported in the original pipelinewise-singer-python framework. I found moving from orjson used by pipelinewise to msgspec sped up my ingestion by 13.3% shaving approximately one minute off a 7:30 load time of some large tables from Oracle to Snowflake.

https://github.com/s7clarke10/pipelinewise-singer-python/releases/tag/4.0.0

I intend to start updating a number of taps which I maintain which use the older singer-python framework to this latest patched version.

edgarrmondragon commented 4 months ago

That's great! And FWIW I think there may be even bigger gains to be found when using msgspec in targets, but I've yet to confirm :)