naomijub / edn-rs

[DEPRECATED]: Crate to parse and emit EDN
https://crates.io/crates/edn-rs
MIT License
81 stars 14 forks source link

Numbers don't match with specification. #106

Closed Grinkers closed 10 months ago

Grinkers commented 1 year ago

Right now the numbers are

Int(isize),
UInt(usize),
Double(Double),

where:
isize/usize are architecture dependent (hardware pointer size)
Double(i64, u128)

https://github.com/edn-format/edn#integers https://github.com/edn-format/edn#floating-point-numbers

According to the spec, integers should be i64 and floating points should be f64. I don't think we can de arbitrary precision with N or M without software implementations. I think perhaps things should be i64 and f64 in the normal case. With #105 maybe we can match against N to i128? I don't think we can practically match the N/M spec without pulling in dependencies, however the current isize/usize is certainly off-spec compared to the expected types, depending on the target architecture.

Changing this would be a breaking pub API change. Has there been any precedence? Perhaps a 0.18.0?

naomijub commented 1 year ago

I don't mind using i64 and u64 for integers, and something like i128/u128 for BigInt. However, there is certainly a problem using f64 for decimal and big decimal as it breaks precision, ordering and hashing. It would require a library like ordered float to wrap that. @evaporei opinions?

evaporei commented 1 year ago

I have the same opinion for integers.

As for floating point, I'm ok with bringing ordered float back again into the codebase.

Grinkers commented 1 year ago

Is there a reason for a separate u64 on top of i64? I suppose it's not an issue because clojure will read 2^64-1 as a BigInt. When converting from a u64 to a string it should probably add the N character at the end if it is over 2^32. I think this sort of issue is probably more painful than just having an i64 and having the user code have a try_into() (there would likely need an if/match to determine Int vs UInt anyway)

I hadn't considered FP ordering and hashing. I'm not sure clojure handles arbitrary precision here either (cljs almost surely doesn't).

naomijub commented 1 year ago

Is there a reason for a separate u64 on top of i64? I suppose it's not an issue because clojure will read 2^64-1 as a BigInt. When converting from a u64 to a string it should probably add the N character at the end if it is over 2^32. I think this sort of issue is probably more painful than just having an i64 and having the user code have a try_into() (there would likely need an if/match to determine Int vs UInt anyway)

I hadn't considered FP ordering and hashing. I'm not sure clojure handles arbitrary precision here either (cljs almost surely doesn't).

Basically, I had an issue that the number was larger than i64::MAX. Why not just use all resources available and allow the Rust version to be compatible with larger numbers like i128?

Grinkers commented 1 year ago

Actually, clojure.edn/read-string automatically converts things over 2^32 to BigInt without the N. So we don't need to add it to the serialization, but it still might not be a bad idea. My worry was a u64 being serialized without error and clojure not being able to parse it.

I would be happy with usize/isize to u64/i64, consistent results on all architectures.

naomijub commented 11 months ago

Have you experimented with a branch for this?

Grinkers commented 11 months ago

I do not. I can start one though. I haven't thought about this in a little bit, but I think the TODOs and/or things to evaluate are

  1. pointer size to u64/i64 (breaking change)
  2. #[non_exhaustive] for Edn enum (breaking change)
  3. parse BigInt to u128/i128 (things like this wouldn't be breaking with non_exhaustive in the future)
  4. discuss/decide on FP (almost certainly a breaking change)
  5. error handle type casting, as right now it does c-style type casting, which will silently ignore bits and give very wrong answers. (behavior change) for example
    
    fn some_u8() -> Result<u8, EdnError> {
    let edn_str = "9002";
    // returns Ok(42)
    edn_rs::from_str(edn_str)
    }

fn some_i8() -> Result<i8, EdnError> { let edn_str = "214"; // returns Ok(-42) edn_rs::from_str(edn_str) }


I don't know what to do about floating point numbers. I think the current implementation is very dangerous though. Certainly allowing silent c-style casting of f64 to f32 should probably be removed. I would think just using f64 everywhere (as suggested by the edn spec) would be easiest. Doing this also ensures we're conforming to the IEEE-754 spec (by inheriting from rust's std). I believe this is also what is ensured in both clj and cljs by the jvm/js

Here's an example where we certainly do not conform (and have a fallible impl). Other edge cases would be rounding implementations, inf, -inf, and NaN

fn some_f64() -> Result<f64, EdnError> { let edn_str = "999999999999999999999.0"; // panicked at 'called Result::unwrap() on an Err value: ParseIntError { kind: PosOverflow }', src/edn/mod.rs:202:39 edn_rs::from_str(edn_str) }

naomijub commented 11 months ago

I do not oppose any of them, I just didn't understand number 4, what is FP?

Maybe the casting could be done with option as well, so you could have a Result casting and an option casting, which seems more rusty like the checked operations

naomijub commented 11 months ago

There might have some work on end-derive crate

Grinkers commented 11 months ago

I do not oppose any of them, I just didn't understand number 4, what is FP? floating-point. Sorry for not being more clear.

Trying to make Double conform to IEEE-754 is quickly going to end up just reinventing td::f64 and ordered_float (external crate). Using only f64 everywhere also helps with equality rules and seems to be what edn expects anyway. End users can handle the type conversions themselves (it would probably a domain specific with edge cases out of scope of a parsing library).

With number 2 on my list, we could add ArbitraryPrecision later down the road without breaking changes. Maybe there will be a nice crate or even a std impl at some point. I can't imagine an obvious use case to serialize arbitrary precision data like this, but it keeps the option available.

naomijub commented 11 months ago

Sounds like a solid plan. We did considered using ordered-float, but it was a heavy dependency at the time. Not sure how it looks now

Grinkers commented 11 months ago

ordered-float is pretty small, at least with default-features = false (the only extra dependency is num-traits).

I think I'll bring in ordered-float and see if it can work nicely without default-features. Then introduce our own default-features with the first addition being a feature sets. So to an end user they'll get correct sets by default, but somebody who cares about std only (and doesn't use sets) can opt-out.

Grinkers commented 10 months ago

113 and #114 resolve this issue.

Now might be a good time for a release.