Open s7clarke10 opened 2 years ago
One of the things in the original pipelinewise-singer-python was the use of orjson for doing json dumps as it is a lot faster.
That's right, although the SDK never used that since it depended on an older version of pipelinewise-singer-python
(see https://github.com/meltano/sdk/pull/738) and there was a bug in orjson that was a deal-breaker for us (see https://github.com/ijl/orjson/issues/251). Seems to have been fixed, but pipelinewise-singer-python
is stuck with an older version of orjson π€·ββοΈ.
But, after #979, we can perhaps explore using orjson
directly.
Also, there's multiple ways to tackle performance in Singer pipelines, and we've already received feedback about improvements using BATCH
messages where SerDe of Singer messages is not a bottleneck.
That's great feedback @edgarrmondragon, also thanks for taking the time to provide the back story re orjson.
This has been updated in my fork of pipelinewise-singer-python
:
Method is BatchMessage.format_message
def format_message(message, option=0):
def default(obj):
if isinstance(obj, decimal.Decimal):
return int(obj) if float(obj).is_integer() else float(obj)
raise TypeError
return orjson.dumps(message.asdict(), option=option, default=default)
https://github.com/mjsqu/pipelinewise-singer-python/commit/e31c16895c8900dad0bd96b98c43f9a5c6137905
Testing a half-million (500,000) row table using a branch in my tap-db2 variant - a rewrite using the Meltano SDK, the cProfile
results show significant time devoted to writing record messages:
ncalls tottime percall cumtime percall filename:lineno(function)
948/1 0.023 0.000 215.936 215.936 {built-in method builtins.exec}
1 0.000 0.000 215.936 215.936 cprofile_db2_sdk.py:2(<module>)
1 0.000 0.000 215.254 215.254 core.py:1128(__call__)
1 0.000 0.000 215.254 215.254 core.py:987(main)
1 0.000 0.000 215.253 215.253 core.py:1393(invoke)
1 0.000 0.000 215.253 215.253 core.py:709(invoke)
1 0.000 0.000 215.253 215.253 tap_base.py:442(cli)
1 0.001 0.001 214.917 214.917 tap_base.py:402(sync_all)
1 0.312 0.312 214.271 214.271 core.py:1159(sync)
532966 3.020 0.000 213.958 0.000 core.py:1041(_sync_records)
532965 1.301 0.000 189.948 0.000 core.py:856(_write_record_message)
1065930 4.345 0.000 165.491 0.000 core.py:823(_generate_record_messages)
The speed of the same table in seconds for Meltano SDK vs tap-db2 1.0.4:
SDK - 126.02 seconds 1.0.4 - 39.40 seconds
@edgarrmondragon - orjson
could be a huge benefit when it comes to database taps especially as they usually have large volumes and run significant numbers of streams.core._write_record_message
calls.
While we're considering alternatives for JSON reading/writing in Python, I'll call out msgspec
.
Its benchmarks show it being significantly more performant than orjson
. It's also a smaller library.
I haven't personally used it, but it (along with others) are worth considering before we choose to use a third-party library to handle JSON writing in the SDK.
While we're considering alternatives for JSON reading/writing in Python, I'll call out
msgspec
.Its benchmarks show it being significantly more performant than
orjson
. It's also a smaller library.I haven't personally used it, but it (along with others) are worth considering before we choose to use a third-party library to handle JSON writing in the SDK.
Latest msgspec only seems to support Python 3.8+, so if we decide for it then it'd be good to follow https://github.com/meltano/sdk/issues/1659 with this.
orjson
also supports supports a couple more platforms, but they are probably not a big deal and msgspec
is explicitly skipping wheels for those.
Having updated the sdk in my own fork to use orjson
:
https://github.com/mjsqu/sdk/tree/feature/orjson_write_messages/singer_sdk
I can see some speed benefits, however the simplejson
module has an advantage in terms of serialization of the decimal.Decimal
type, it simply outputs the full value of the decimal.Decimal
to the json, without wrapping in quotes:
simplejson
has an option called use_decimal=True
which when used, outputs the full value of the number:
>>> simplejson.dumps(decimal.Decimal("1111111111111.333333333333333333"),use_decimal=True)
'1111111111111.333333333333333333'
Digging down into the module code, this is because the core encoder.py
uses str()
when writing out the serialized object, not before, (so the decimal.Decimal
remains as that type for longer - beginning to confuse myself here tbh π ):
https://github.com/simplejson/simplejson/blob/master/simplejson/encoder.py#L521-L522
elif _use_decimal and isinstance(value, Decimal):
yield buf + str(value)
And str
on a decimal.Decimal
keeps all the precision:
>>> str(decimal.Decimal("1111111111111.333333333333333333"))
'1111111111111.333333333333333333'
orjson
does not handle decimal.Decimal
types natively:
>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Type is not JSON serializable: decimal.Decimal
Implement a default function, as described in the orjson
README.md
>>> import orjson,decimal
>>> def default(obj):
... if isinstance(obj, decimal.Decimal):
... return str(obj)
... raise TypeError
...
>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"),default=default)
b'"1111111111111.333333333333333333"'
The result is wrapped in quotes and would end up in JSON as:
"record":{"decm_scale": "1111111111111.333333333333333333"}
At this point, we could try to use float
to handle decimal.Decimal
, however precision is lost:
>>> import orjson,decimal
>>> def default(obj):
... if isinstance(obj, decimal.Decimal):
... return float(obj)
... raise TypeError
...
>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"),default=default)
b'1111111111111.3333'
msgspec
appears to handle decimal.Decimal
out-of-the-box, however it is still setting it to a string:
>>> msgspec.json.encode(decimal.Decimal("1111111111111.333333333333333333"))
b'"1111111111111.333333333333333333"'
msgspec
allows the use of handlers but, again, you are left with the choice of Python native types which either stringify or lose precision.
Using msgspec
or orjson
it isn't possible to exactly mirror the results of simplejson
in terms of the RECORD
messages:
module | resulting RECORD key:value |
---|---|
simplejson |
{"decm_scale": 1111111111111.333333333333333333} |
orjson or msgspec |
{"decm_scale": "1111111111111.333333333333333333"} |
Further reading on SO: https://stackoverflow.com/questions/1960516/python-json-serialize-a-decimal-object
@edgarrmondragon @WillDaSilva - I guess the questions for these two improved modules are:
@mjsqu thanks for that amazing deep dive!
I think we're gonna have to live with the loss of precision π. Probably very few taps are de-serializing number values in API responses to decimal.Decimal
, or at least we don't do that in the SDK. We're de-serializing as float
and thus serializing to valid JSON number
values, so at least schema validation has not been a concern.
In order not to break taps that do rely on use_decimal=True
serialization, we'd need to conform decimal types to float
as part of _conform_primitive_property
: https://github.com/meltano/sdk/blob/dfcec5fb4f26329f6f6963626b51003296240e0d/singer_sdk/helpers/_typing.py#L469
With that in mind, we'd sacrifice numeric precision for a minority of taps for a gain in performance.
Also, I really like msgspec
's union de-serialization and it's close to what I tried to go for in a Rust implementation of the Singer message spec:
import typing as t
import msgspec
class Message(msgspec.Struct, tag_field="type", tag=str.upper):
pass
class Record(Message):
stream: str
record: dict[str, t.Any]
class Schema(Message):
stream: str
schema: dict[str, t.Any]
msg = Record("users", {"id": 1})
buf = msgspec.json.encode(msg)
roundtrip = msgspec.json.decode(buf, type=t.Union[Record, Schema])
assert isinstance(roundtrip, Record)
@edgarrmondragon I ran across rapidjson
it isn't as fast as msgspec
it is close to ujson
. The neat thing is it handles decimals.Decimals
just like simplejson
.
>>> import decimal
>>> import rapidjson
>>>
>>> KEY = "some_stuff"
>>> VALUE= decimal.Decimal("1111111111111.333333333333333333")
>>>
>>> record={KEY: VALUE}
>>> print(f"{KEY}: {VALUE} {type(VALUE)}\n")
some_stuff: 1111111111111.333333333333333333 <class 'decimal.Decimal'>
>>> fast_and_precise = rapidjson.NM_NATIVE | rapidjson.NM_DECIMAL | rapidjson.NM_NAN
>>> iso_datetime = rapidjson.DM_ISO8601
>>> rd = rapidjson.dumps(record, number_mode=fast_and_precise, datetime_mode=iso_datetime)
>>> print(f"{rd}\n")
{"some_stuff":1111111111111.333333333333333333}
>>>
You can see rapidjson
performance numbers here.
https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html
I would really like to advocate for accurate movement of decimal data. We can't assume we understand the usage of the data and whether the accuracy of the scale or precision is important. I would from a trust of the overall Meltano eco-system, taps and targets advocated accuracy over speed.
It is for this reason that we are using Singer.Decimal notation to ensure we don't get rounding in our data decimal and float datatypes.
If rapidjson
is a main stream library going forward and it support decimal correctly unlike orjson
I would advocate picking that. Alternatively I wonder if the maintainers of orjson would be open to a PR to fully support decimal data?
Oh, the other thought is I wonder if msgspec
would be open to supporting the Decimal Data without the double quotes?
There seems to be some discussion about this as an Issue on the msgspec
repo. https://github.com/jcrist/msgspec/issues/349
If rapidjson is a main stream library going forward and it support decimal correctly unlike orjson I would advocate picking that.
I really like how python-rapidjson
works but the base rapidjson
repo that it is a wrapper for hasn't seen a version change since 2016 even though there are commits made recently. That makes me feel uneasy. I worry the lack of versioning makes pegging to a particular version in poetry useless and could lead to disaster if a change to rapidjson
introduces an issue for SDK taps and targets.
I would from a trust of the overall Meltano eco-system, taps and targets advocated accuracy over speed.
I agree with this 100 percent.
Alternatively I wonder if the maintainers of orjson would be open to a PR to fully support decimal data?
There have been issues raised about adding decimals to orjson
. Here is a link I found.
https://github.com/ijl/orjson/issues/172: Feature request for dumps decimal
It looks like there was a PR but it was not merged because no clear decision was made by the authors of the PR whether to serializing as a string
or a C or Rust decimal.
Oh, the other thought is I wonder if msgspec would be open to supporting the Decimal Data without the double quotes?
I am having a hard time understanding how to use msgspec
in the SDK. It wants everything to be in a struct
that gives it python type hints for all the elements. it looks like this is needed for both encoding and decoding which to me means it is not compatible with jsonschema
and seems to act more like a replacement.
https://jcristharif.com/msgspec/supported-types.html#decimal
>>> import decimal
>>> x = decimal.Decimal("1.2345")
>>> msg = msgspec.json.encode(x)
>>> msg
b'"1.2345"'
>>> msgspec.json.decode(msg, type=decimal.Decimal)
Decimal('1.2345')
>>> msgspec.json.decode(b'"oops"', type=decimal.Decimal)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
msgspec.ValidationError: Invalid decimal string
I am still reading and playing with msgspec
and my thoughts on this might change as I learn more.
Hi, msgspec
author here, following the link back here from https://github.com/jcrist/msgspec/issues/440.
It wants everything to be in a struct that gives it python type hints for all the elements. it looks like this is needed for both encoding and decoding which to me means it is not compatible with jsonschema and seems to act more like a replacement.
This isn't correct. If no types are specified (the default), msgspec.json.decode
works like any other JSON library in python, returning a mix of dict
/list
/int
/float
/str
/bool
/None
objects:
In [1]: import msgspec
In [2]: msgspec.json.decode('{"hello": 1, "world": 2}') # works the same as any other json.loads
Out[2]: {'hello': 1, 'world': 2}
In [3]: msgspec.json.encode(["no", "need", "to", "use", "structs"]) # works the same as any other json.dumps
Out[3]: b'["no","need","to","use","structs"]'
There's no need to use structs for typed decoding if you don't want to.
The highest performance on the decoding side will come if you use msgspec for both decoding & validation, but you can also use msgspec just for serialization, same as you would any other json
library in python.
If you have any suggestions for how we could make that clearer in our docs, I'd appreciate them. This is covered both on the main page and in our usage docs, but since there was a misunderstanding here clearly we can do better :).
I can see some speed benefits, however the simplejson module has an advantage in terms of serialization of the decimal.Decimal type, it simply outputs the full value of the decimal.Decimal to the json, without wrapping in quotes
I'm not familiar with meltano
and the needs of this project, but I'll note that most JSON libraries (in any language) don't support access to the raw serialized string value when decoding JSON numbers, which means few libraries will be able to decode a decimal value serialized as a number instead of a string without losing precision. If cross-language compatibility and avoiding precision loss is important for this project, I'd recommend considering moving to serializing decimal values as strings if possible. See also my response here: https://github.com/jcrist/msgspec/issues/440#issuecomment-1588066782.
Anyway, no pressure to use msgspec
over any other tool. If you decide it may be a good fit, I'm always happy to answer questions or provide tips.
I have been playing with the three JSON libraries mentioned in this issue and thought it might be good to publish my somewhat working versions of the SDK utilizing them as draft PRs. I promise all three completed pytest locally (ok except the S3 test).
I have been doing some testing locally with tap-stackoverflow-sampledata
and tap-mssql--buzzcutnorman
extracting the StackOverflow postlinks
file data which has 8 million rows in it. To run both taps utilizing the SDK I created a test Meltano project and installed them. Then activated the taps venv
and upgraded the SDK to a local version that utilizes one of the JSON libraries. The whole process looks like this.
meltano install extractor tap-stackoverflow-sampledata --clean
.\.meltano\extractors\tap-stackoverflow-sampledata\venv\Scripts\activate.ps1
pip install --upgrade -e ..\..\MeltanoContribute\sdk\
deactivate
I then run meltano invoke tap-mssql | Out-Null
or meltano invoke tap-stackoverflow-sampledata | Out-Null
and note the sync_duration
. Below are the results. Changing out the JSON libraries in format_message
lead to some time gains.
tap | simplejson | msgspec | orjson | rapidjson |
---|---|---|---|---|
tap-mssql | 960.95 | 908.82 | 884.73 | 946.28 |
tap-stackoverflow-sampledata | 1009.45 | 847.67 | 829.91 | 866.97 |
Here are some notes I have on all three.
Msgpsec requires python 3.8 or greater. I bumped the lowest version of python in the pyproject.toml
to 3.8
so I could add it.
The json.encode
generates a bytes
instead of a str
. If you run a .decode()
before returning everything works. One nice touch was the json.format
and option indent=0
which returns JSON on one line and adds a single space between items. This matches up with json.dumps
. Tap-sqllite tests failed along with invoking tap-mssql. Both taps returned the following error:
TypeError: Only dicts with str-like or int-like keys are supported
Tracked this down to the fact SQLAlchemy results return the type <class 'sqlalchemy.sql.elements.quoted_name'>
for the column names. This can be corrected in post_process
by converting all keys to str
. I was able to put the default to str
as before with enc_hook=str
. I don't have any handling for decimals
currently. They are going out as strings which is causing JSONSchema validation errors if you try a meltano run
to a target. There is a nice way to append \n
to rows I just haven't gotten that in place yet.
Adding orjson was uneventful. It also returns a bytes
instead of a str
but just like with mspec if you do a .decode()
before you return everything works ok. One thing of note is the JSON returned has no spaces which is different from the base json or simplejson library. There were a number of tests that check against JSON strings which had to be updated. I was able to put the default to str
as before. I do not have any handling of decimals in place so they are going out as str
at the moment. Which will lead to the same JSONSchema validation error on the target side like I mentioned in my msgspec write up. There is a nice way to append \n
to the JSON string which I did put in place. I did run into issues with non str
keys like I did with mspec but was able to get around it by added the json.OPT_NON_STR_KEYS
option.
Adding rapidjson was uneventful. The main issue is python doesn't seem to understand a thing about it. No autocomplete or info on methods in VSCode. It runs and works but I miss the visual cues. A str
is returned by the .dumps
method which is nice. All the whitespace has been removed from the JSON string just like orjson. There were a number of tests that check against JSON strings which had to be updated. I was able to put the default to str
as before. The big win with rapidjson is support for decimals just like simplejson.
Thanks for the informative writeup @BuzzCutNorman!
Msgpsec requires python 3.8 or greater.
Worth mentioning that the SDK will be dropping support for Python 3.7 soon, since Python 3.7 reaches end-of-life on 2023-06-27:
A few notes on your msgspec implementation:
One nice touch was the
json.format
and optionindent=0
which returns JSON on one line and adds a single space between items.
While this works, it does have a performance cost. If you're trying to optimize JSON performance you likely don't want whitespace anyway, and comparing msgspec with the extra format call to other libraries without it unnecessarily hampers msgspec's performance.
Tracked this down to the fact SQLAlchemy results return the type <class 'sqlalchemy.sql.elements.quoted_name'> for the column names. This can be corrected in post_process by converting all keys to str.
I'd be happy to push a fix upstream for this, supporting string subclasses as keys in a dict
seems like a fine thing to support builtin. I've opened https://github.com/jcrist/msgspec/issues/450 to track this.
I don't have any handling for decimals currently. They are going out as strings which is causing JSONSchema validation errors if you try a meltano run to a target.
As I mentioned above, I highly recommend using strings to represent decimal types, as encoding them as numbers makes it harder for downstream tools to decode without losing precision. That said, we have an open issue for this, and if it's a blocker for meltano's use case I'd be happy to push up a fix (it's pretty simple to support).
I'd be happy to push a fix upstream for this, supporting string subclasses as keys in a dict seems like a fine thing to support builtin. I've opened https://github.com/jcrist/msgspec/issues/450 to track this.
@jcrist Thank you π for creating that case I will update it with the info I have.
One nice touch was the json.format and option indent=0 which returns JSON on one line and adds a single space between items.
While this works, it does have a performance cost. If you're trying to optimize JSON performance you likely don't want whitespace anyway, and comparing msgspec with the extra format call to other libraries without it unnecessarily hampers msgspec's performance.
Here are the updated times for msgspec with the json.format
step removed.
tap | simplejson | msgspec | orjson | rapidjson |
---|---|---|---|---|
tap-mssql | 960.95 | 802.05 | 884.73 | 946.28 |
tap-stackoverflow-sampledata | 1009.45 | 825.81 | 829.91 | 866.97 |
Comparision between the two msgspec times with and without json.format
tap | with format | with out format | difference |
---|---|---|---|
tap-mssql | 908.82 | 802.05 | 106.77 |
tap-stackoverflow-sampledata | 847.67 | 825.81 | 21.86 |
I'd be happy to push a fix upstream for this, supporting string subclasses as keys in a dict seems like a fine thing to support builtin. I've opened https://github.com/jcrist/msgspec/issues/450 to track this.
This has been resolved in https://github.com/jcrist/msgspec/pull/454. If you want to try it out before the next release, you can install the main branch from GitHub.
@jcrist Thank You πfor getting this added so quickly. I did some testing this weekend and more this morning and it is working great. No need to convert all keys to str
in the post_process
method. I will remove that code from the msgspec PR today.
Some of you might have already known this but I just learned we don't need to run .decode()
on the bytes
returned by msgspec or orjson if we write()
the results of format_message
to sys.stdout.buffer
. I didn't see anything in the Singer Spec that says the messages had to be str
type. I did check and the listen
and _process_lines
methods in the SingerReader
class are both capable of dealing with bytes
. Though it seems the line
s from file_input
get converted to str
sometime before they hit json.loads
.
I was wondering If there is a reason simplejson
was not put into the singer_sdk.io_base.SingerReader._process_lines()
method to match its use in the singer_sdk._singerlib.format_message()
function?
As I mentioned above, I highly recommend using strings to represent decimal types, as encoding them as numbers makes it harder for downstream tools to decode without losing precision. That said, we have an open issue for this, and if it's a blocker for meltano's use case I'd be happy to push up a fix (it's pretty simple to support).
This issue has now been resolved (PR here). I still recommend encoding decimal values as strings for cross-tool compatibility, but the upcoming msgspec release will include support for encoding and decoding decimal.Decimal
values to/from JSON numbers instead of just strings.
I just updated the SDK PR #1784 with msgspec to use version 0.17.0 which has all the fixes and features mentioned above. The format_message
function is using enc_hook=str
(which converts all unknown datatypes to string) and decimal_format="number"
(which returns beautifully formatted decimal numbers). Please give it a look or try and let me know what you think.
A big THANKS πto @jcrist
Does anyone have any objections to me closing the SDK PRs for orjson
and rapidjson
?
Does anyone have any objections to me closing the SDK PRs for
orjson
andrapidjson
?I was wondering If there is a reason
simplejson
was not put into thesinger_sdk.io_base.SingerReader._process_lines()
method to match its use in thesinger_sdk._singerlib.format_message()
function?
Just seeing this. The answer is simply oversight π .
Does anyone have any objections to me closing the SDK PRs for
orjson
andrapidjson
?
@BuzzCutNorman None at all, I think folks (including myself) in this discussion feel more inclined to further investigate msgspec
as our serde library.
@edgarrmondragon Cool, I will delete those draft PRs tomorrow. Thanks, for answering the question about simplejson
and singer_sdk.io_base.SingerReader._process_lines()
. Should I add msgspec
to singer_sdk.io_base.SingerReader._process_lines()
in the SDK PR with msgspec
?
T
@edgarrmondragon Cool, I will delete those draft PRs tomorrow. Thanks, for answering the question about
simplejson
andsinger_sdk.io_base.SingerReader._process_lines()
. Should I addmsgspec
tosinger_sdk.io_base.SingerReader._process_lines()
in the SDK PR withmsgspec
?
Thank you for your work on this @BuzzCutNorman
Should I add
msgspec
tosinger_sdk.io_base.SingerReader._process_lines()
in the SDK PR withmsgspec
?
@BuzzCutNorman Yeah, I'm very curious if there's gains in deserialization too. Although I don't think it's a blocker for the serialization side, so feel free to add it or punt it as you see fit π.
@jcrist I have been working on adding msgspec
into the SDK SingerReader.deserialize_json
method. It was easy to add in as a replacement. The only functionality I have not been able to reproduce is the parse_float
option of json.loads
. Currently json.loads
in the deserialize_json
method looks like this
return json.loads( # type: ignore[no-any-return]
line,
parse_float=decimal.Decimal,
)
I setup a dec_hook
function to attempt changing float
to decimal.Decimal
. It didn't work but to be honest I was not expecting it would not since the msgspec
documentation sounded like dec_hook
is only called if it is a non-supported data type. I mention it now for full disclosure of what I have tried and also in case this should have worked, and I just had my code wrong.
def dec_hook(type: t.Type, obj: t.Any) -> t.Any:
# `type` here is the value of the custom type annotation being decoded
if type is float:
return decimal.Decimal(obj)
else:
try:
return str(obj)
except:
# Raise a NotImplementedError for other types
raise NotImplementedError(f"Objects of type {type} are not supported")
I looked at convert
but it didn't seem to me like it was meant for this scenario. I was wondering if you had any suggestions on how I might reproduce parse_float
functionality using msgspec
.
Here is an example:
>>> import decimal, json, msgspec
>>> encoder = msgspec.json.Encoder(enc_hook=str, decimal_format="number")
>>> decoder = msgspec.json.Decoder(dec_hook=str)
>>> x = decimal.Decimal('1.3')
>>> extract = {"id": 1, "column1": x}
>>> line = encoder.encode(extract)
>>> print(line)
b'{"id":1,"column1":1.3}'
>>> json.loads(line, parse_float=decimal.Decimal)
{'id': 1, 'column1': Decimal('1.3')}
>>> decoder.decode(line)
{'id': 1, 'column1': 1.3}
Using Structs
I can generate a dict
that is identical to the parse_float
example. I am using msgspec.defstruct
since the data format is unknown until we get a jsonschema message. We would need to find a way to covert the jsonschema into an iterable item to pass to msgspec.defstruct
to create a Sturct
to use during the decode process. Here is an extension of the example.
>>> decoder_ring = msgspec.defstruct('myStream',[("id", int), ("column1", decimal.Decimal)])
>>> msgspec.structs.asdict(msgspec.json.decode(line, type=decoder_ring))
{'id': 1, 'column1': Decimal('1.3')}
I was wondering if we should change the title of this issue from "[Feature]: Faster JSON dumps in format_message
with orjson
" to "[Feature]: Faster JSON dumps in format_message
with msgspec
"
The only functionality I have not been able to reproduce is the
parse_float
option ofjson.loads
There currently isn't a clean way of doing this, but it's a feature I'd be happy to add. I've opened https://github.com/jcrist/msgspec/issues/507 to track this. I'd expect this to be a fairly quick feature to add, should be able to get it out with the upcoming release next week.
Using
Structs
I can generate adict
that is identical to theparse_float
example. I am usingmsgspec.defstruct
since the data format is unknown until we get a jsonschema message.
If the decoding side knows the expected schema (either statically, or dynamically as you've mentioned here), then parsing into a msgspec.Struct
object will be the fastest way to decode assuming the consumers can work with the struct directly. Otherwise you might want to decode into a TypedDict
(which is a dict
at runtime). Note that this also does type validation of the inputs, which would then let downstream consumers avoid doing unnecessary type checking.
If knowing the schema beforehand (or converting the schema into python type annotations) is tricky, then using the parse_float
option as described above should be doable and would match your current behavior. This will only be a little bit slower than parsing into a struct or typed dict, but won't do any type validation.
That is funny I was just writing an update to say I got a kind of working SingerMessageType.SCHEMA
, which are jsonschema
formatted, to Struct
method which during the process is typing all "type":"number"
fields to Decimal
via the Struct
.
I thought I'd just drop this in here for peoples awareness since it hasnt been raised so far.
https://github.com/simdjson/simdjson
As far as I'm aware, simdjson is by far the most performant json library.
Looking at simdjson
there are three projects that make it useable in Python. The are:
libpy_simdjson The project has not been updated since October 2020.
cysimdjson The project has been recently updated and seems to be very active. Given the note on performance I don't know if it will be a good fit. I really appreciate that they added this statement right up front.
Trade-offs
The speed of cysimdjson is based on these assumptions:
- The output of the parser is read-only, you cannot modify it
- The output of the parser is not Python dictionary, but lazily evaluated dictionary-like object
- If you convert the parser output into a Python dictionary, you will lose the speed
If your design is not aligned with these assumptions, cysimdjson is not a good choice.
pysimdjson
This is a recently updated project that has good documentation. I did find that it currently does not allow for decimals only floats. This is an issue that has been in place since August of 2020. The main simdjson
project needs to make a change before this can be added by cysimdjson
.
Hi, I have fully implemented msgspec support (including singer_decimal) updated and supported in the original pipelinewise-singer-python framework. I found moving from orjson
used by pipelinewise to msgspec
sped up my ingestion by 13.3% shaving approximately one minute off a 7:30 load time of some large tables from Oracle to Snowflake.
https://github.com/s7clarke10/pipelinewise-singer-python/releases/tag/4.0.0
I intend to start updating a number of taps which I maintain which use the older singer-python framework to this latest patched version.
That's great! And FWIW I think there may be even bigger gains to be found when using msgspec in targets, but I've yet to confirm :)
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
Speed of writing messages can add up when you are dumping a large amount of data. One of the things in the original pipelinewise-singer-python was the use of orjson for doing json dumps as it is a lot faster.
The particular line in particular is this.
https://github.com/meltano/sdk/blob/796ccda470529253b5211b70742c814d2b20ec07/singer_sdk/_singerlib/messages.py#L165
We have a variant of pipelinewise-singer-python which uses orjson in favour of json.dumps for improved speed. The code for that section is as follows.
Sorry the formatting is all out, I couldn't work out how to get the code formatting working. Best look at the example below. https://github.com/mjsqu/pipelinewise-singer-python/blob/2a839d245b92d1ec6d5801a30f5280fa77b22d0c/singer/messages.py#L295
I note that while orjson is faster it is not perfect, for example we need to convert decimal data as a string. There are some other considerations mentioned with the module but if they are not an issue, we have measured a considerable performance gain using this and verified this by the cProfile to look at the time consumed.