swaroopch / edn_format

EDN reader and writer implementation in Python, using PLY (lex, yacc)
https://swaroopch.com/2012/12/24/edn-format-python/
Other
131 stars 31 forks source link

HEX Numeric notation not supported #62

Closed hipitihop closed 4 years ago

hipitihop commented 4 years ago

Hex notation e.g. 0xFFDC73 causes traceback:

Sample EDN:

{"AMBER" {:id "AMBER" :name "Amber 80%" :color 0xFFDC73}}

pip details:

"edn-format": {
            "hashes": [
                "sha256:aa3cc3041497b7e22386f96c44658e589597d9d26df39c4a77bbac0c8091ea88"
            ],
            "index": "pypi",
            "version": "==0.6.3"
        }

  return edn_format.loads(edn_str)
File "/usr/local/lib/python3.6/site-packages/edn_format/edn_parse.py", line 161, in parse
  return p.parse(text, lexer=lex())
File "/usr/local/lib/python3.6/site-packages/ply/yacc.py", line 333, in parse
   return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
File "/usr/local/lib/python3.6/site-packages/ply/yacc.py", line 1120, in parseopt_notrack
   p.callable(pslice)
File "/usr/local/lib/python3.6/site-packages/edn_format/edn_parse.py", line 78, in p_map
   raise EDNDecodeError('Even number of terms required for map')
edn_format.exceptions.EDNDecodeError: Even number of terms required for map
bfontaine commented 4 years ago

Those aren’t valid EDN: https://github.com/edn-format/edn#integers

c00kiemon5ter commented 4 years ago

Yes, HEX notation is not defined by the spec - only "plain" integers. There are edn parsers that work with hex, though. Being liberal like that is good, but unless the communicating parties have agreed to that, they cannot claim that they talk edn. One should properly support hex through an extension.

Also see, https://github.com/edn-format/edn/issues/47

bfontaine commented 4 years ago

Yeah, this is why I dislike the EDN format from a parser implementation perspective: there’s no unique spec for it.

Anyway, do you want to try and make a pull-request about this? Integers are parsed here: https://github.com/swaroopch/edn_format/blob/0d4c2136663b57bef913b321313f2651ca6649a3/edn_format/edn_lex.py#L217-L222 You can add tests here: https://github.com/swaroopch/edn_format/blob/0d4c2136663b57bef913b321313f2651ca6649a3/tests.py#L93-L95

c00kiemon5ter commented 4 years ago

Yeah, this is why I dislike the EDN format from a parser implementation perspective: there’s no unique spec for it.

There is one unique spec. It is defined here: https://github.com/edn-format/edn

What implementations allow for, doesn't make the spec less of a spec - it makes the implementation more liberal. This is fine technically, but it does confuse reasoning about what is "actually" supported. This is solved very easily by reading the spec (which is short and simple). Technically, this could be solved with "strict" (or "lax") flag on the parser.

If we want to make edn_format more liberal, to accept integers in hex format, then this should be a separate token type (to easily use it in "lax" mode, or ignore it in "strict" mode), ie t_INTEGER_HEX. This will also make it easier to reason about the regex.

btw, edn_format already supports t_RATIO which is not defined in the spec.

bfontaine commented 4 years ago

By "spec" I meant something strict like a BNF grammar. The EDN document starts with:

Currently this specification is casual, as we gather feedback from implementors. A more rigorous e.g. BNF will follow.

We’re left with either that document or Clojure’s reference implementation. However, Clojure doesn’t respect that spec: it supports things that aren’t spec’d (like ratios https://github.com/edn-format/edn/issues/64 or NaN/Inf https://github.com/edn-format/edn/issues/2) and rejects things that are (https://github.com/edn-format/edn/issues/68).

With no update to that document since 2014, parser implementers are left alone.

bfontaine commented 4 years ago

@swaroopch It’s your call, your’re the owner 😃 What do you want to do here? Should edn_format support that notation?

The flag solution may be interesting, but it significantly complicates the code because it means dynamically changing the grammar.

I think it makes sense to support t_RATIO because it allows to represent some data that has no other representation (think 1/3). It’s a native type in Clojure and as such is dumped in its own representation when one uses pr-str. On the other hand, hexadecimal notation never occurs in dumped data because it’s just a different representation for an integer:

(pr-str 1/3) ; => "1/3"
(pr-str 0xFF) ; => "255"
swaroopch commented 4 years ago

@hipitihop @c00kiemon5ter @bfontaine Thanks for the great discussion on this topic 👍🏼

My inputs are: