naomijub / edn-rs

[DEPRECATED]: Crate to parse and emit EDN
https://crates.io/crates/edn-rs
MIT License
81 stars 14 forks source link

Can't parse symbols according to spec. More powerful "parse_edn" pattern matching? #105

Closed Grinkers closed 1 year ago

Grinkers commented 1 year ago

According to https://github.com/edn-format/edn#symbols the following is valid (and confirmed with clojure.edn/read-string).

(clojure.edn/read-string "-foobar")
-foobar
(type (clojure.edn/read-string "-foobar"))
clojure.lang.Symbol

Currently parse_edn is only matching on the first character, which is insufficient for this case. -foobar gets treated as a Number. https://github.com/naomijub/edn-rs/blob/8353fc4cbfb1938849ffed0a4a53951b95706f24/src/deserialize/parse.rs#L41

I believe it should be possible to be passing around a &str instead of &mut std::iter::Enumerate<std::str::Chars>, which would allow for str pattern matching. I don't know if this is a slippery slope eventually recreating regex. Before a big change, a discussion doesn't seem like a bad idea. There might be an even easier way or additional considerations.

Rational:

  1. New edge case
  2. It could help clean up #103
  3. If this was done, all the very expensive and heap allocating clone()s on the Enumerate could be removed.
  4. None of this is externally pub, so no other projects would break. I believe it would technically be possible to even have the return not be an owned String but str slices from the original &str too. This would make it so there's no heap allocations, but would also break existing code. Perhaps worth considering before 1.0.0.
Grinkers commented 1 year ago

On this same topic, here's another case where more powerful pattern matching is going to be required

#[test]
fn quoted_symbol() {
    let mut edn = "('(symbol))".chars().enumerate();
    let res = parse(edn.next(), &mut edn).unwrap();

    assert_eq!(
        res,
        Edn::List(List::new(vec![
            Edn::Symbol("'".to_string()),
            Edn::List(List::new(vec![Edn::Symbol("symbol".to_string()),]))
        ]))
    );
}

thread 'deserialize::parse::test::quoted_symbol' panicked at 'assertion failed: `(left == right)`
  left: `List(List([Symbol("'(symbol")]))`,
 right: `List(List([Symbol("'"), List(List([Symbol("symbol")]))]))`',```

expected result

(clojure.edn/read-string "('(symbol))")
(' (symbol))
Grinkers commented 1 year ago

The core issue of symbols not being parsed according to spec hasn't been resolved with the other resolved issues/PR. In particular something like (apply + '(1 2 3)) gets parsed as List([Symbol("apply"), Symbol("+"), Symbol("'(1"), UInt(2), UInt(3)]))

Do you have any thoughts about not using Enumerate for better pattern patching and removal of all the clone()s in parse.rs? Taking a look at it, even though parse is pub, the module is not https://github.com/naomijub/edn-rs/blob/8353fc4cbfb1938849ffed0a4a53951b95706f24/src/lib.rs#L43 so we could change things internally without breaking anything. Maybe I'm missing something, but it would be nice to look into the next characters without having to allocate a new string/iterator. For example, looking for 'n' or 'N' for #106.

Grinkers commented 1 year ago

I was actually able to do this without refactoring the Enumerate, but I still think it could be a good idea.

Grinkers commented 1 year ago

I don't know how much performance is a concern (I assume some, because benchmarks are advertised), but on top of making matching a lot cleaner, there's huge performance and memory usage gains by not using Enumerate.

https://github.com/naomijub/edn-rs/blob/bec6ea2d2b2613b9189366e30ee1fa43b78103c8/examples/edn_from_str.rs right now this example uses 52 heap allocations (6098 bytes in total).

changing it to

let edn_str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

ends up being 231 allocations over 18,995 bytes. Each clone has to deep copy the entire String.

@naomijub was there a reason for Enumerate? Compared to json, I think EDN probably has a lot more concerns, but it's possible to do no_std (0 allocations). I don't think you can do EDN without escaping so no allocations is likely not possible. Using the same interface of returning owned Strings, I think the Lorem example would be only 70 allocations (~500 bytes?) using references internally. Looking at the competition, something like #[serde(borrow)], could even potentially go down to 1 allocation for the Vec.

This is beyond the recent mostly trivial PRs. If you're not against it, I'll take a poke at it later.