A 69-bit integer value in JSON could cause data loss when using kafka source

ly9chee commented 12 months ago

Describe the bug

Hi there, recently I found some parse error logs in cn node like below

2023-10-24T15:48:19.469303701Z ERROR actor{otel.name="Actor 2740" actor_id=2740 prev_epoch=5303626091593728 curr_epoch=5303626134585344}:executor{otel.name="DmlExecutor AB400000001 (actor 2740)" actor_id=2740}:executor{otel.name="SourceExecutor AB400002716 (actor 2740)" actor_id=2740}:source_parser{actor_id=2740 source_id=2085}:parse_one{split_id="0" offset="8"}: risingwave_connector::parser: failed to parse message, skipping error=Protocol error: InvalidNumber at character 776 ('0')

the JSON data mentioned in log at partition 0 offset 8 in kafka topic is skipped by kafka source and result in data loss eventually.

after some invistigation of the source code I realized that this is simd-json parse error caused by a field has a large integer number value, so I wrote a mimimum rust snippet to reproduce this error:

use simd_json;

fn main() {
    let mut d = br#"{"some": 327267367867168100000}"#.to_vec();
    let v: simd_json::BorrowedValue = simd_json::to_borrowed_value(&mut d).unwrap();
    println!("{}", v);
}

output:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { index: 28, character: Some('0'), error: InvalidNumber }

In our kafka upstream, JSON data may contains large integer value inside nested object, it is very hard to detect those values and correct them before ingestion.

{
    "fields":[
        {
            "field_id": "field1",
            "float_value": 327267367867168100000
        }
    ]
}

so, is this an expected behavior? or this field should placed by NULL instead of a parse error?

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

Kubernetes

The version of RisingWave

PostgreSQL 9.5-RisingWave-1.3.0 (c4c31bdc5e8763ae65ec23293e8c07bdfd4ab4df)

Additional context

No response

xiangjinwu commented 12 months ago

https://datatracker.ietf.org/doc/html/rfc8259#section-6

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

Note that when such software is used, numbers that are integers and are in the range [-(2**53)+1, (2**53)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

Also note that proto3 maps 64-bit integer to json string rather than json number: https://protobuf.dev/programming-guides/proto3/#:~:text=int64%2C%20fixed64%2C%20uint64,string

In the long term we do not want to be the bottleneck of limiting json numbers to only 2^53. But for good interoperability it is better not to assume all tools in the data pipeline can support it.

or this field should placed by NULL instead of a parse error?

Currently a parse error during ingestion (source) causes the whole row to be skipped, while an expr error (incl casting from string to some type) during computation (stream) fills a null value to that field. This may feel inconsistent, and thanks for bringing it up.

neverchanje commented 7 months ago

As the user suggests, simd-json 0.13.5 now enables us to convert very-large-numbers into f64, with a potential loss of precisions.

big-int-as-float The big-int-as-float feature flag treats very large integers that won't fit into u64 as f64 floats. This prevents parsing errors if the JSON you are parsing contains very large numbers. Keep in mind that f64 loses some precision when representing very large numbers.

But I still think it's not a good way to deal with such numbers. Converting them to decimal https://crates.io/crates/bigdecimal may be better.

risingwavelabs / risingwave