Closed thiagokokada closed 4 years ago
I discovered a problem when using strings and any other data structure that is not a map (since my use case is mostly maps I didn't find this issue until now). If someone has a good solution for this I am all ears, since the solution I found is mostly duplicating the udump()
logic.
Fixed, the solution is quite ugly though.
I think this PR is ready to review.
Since we have mutual recursion going on here, I'm curious if you know the performance impact?
Sorry, I don't. However I think it should be fine for small EDNs (that is my use case).
I can do some performance tests if you want.
Since we have mutual recursion going on here, I'm curious if you know the performance impact?
Sorry, I don't. However I think it should be fine for small EDNs (that is my use case).
I can do some performance tests if you want.
This is an optional step. Would appreciate if you can take this additional request :)
I have a strange bug on this EDN:
>>> edn = open("bench.edn").read()
>>> from edn_format import loads, dumps
>>> dumps(loads(edn)) == edn
True
>>> dumps(loads(dumps(loads(edn)))) == edn
True
>>> dumps(loads(dumps(loads(edn), indent=2))) == edn
False
dump
ing it after loading it works fine but when I add indent=2
all I get is a vector of keys.
Without indent=2
(same file): https://gist.githubusercontent.com/bfontaine/c3463de9f1ef3a802e8fdce84bfb3adb/raw/6214e56f4772483dd5b326b31c2ec6301816a65a/without-indent.edn
With indent=2
: https://gist.githubusercontent.com/bfontaine/c3463de9f1ef3a802e8fdce84bfb3adb/raw/6214e56f4772483dd5b326b31c2ec6301816a65a/with-indent.edn
I compared the performances of dumping this file 10 × 1000 times on both the master
branch and this one (without indent
):
# master
5596006 function calls (5026006 primitive calls) in 4.095 seconds
5596006 function calls (5026006 primitive calls) in 3.922 seconds
5596006 function calls (5026006 primitive calls) in 3.918 seconds
5596006 function calls (5026006 primitive calls) in 4.055 seconds
5596006 function calls (5026006 primitive calls) in 4.188 seconds
5596006 function calls (5026006 primitive calls) in 4.303 seconds
5596006 function calls (5026006 primitive calls) in 4.044 seconds
5596006 function calls (5026006 primitive calls) in 4.058 seconds
5596006 function calls (5026006 primitive calls) in 3.940 seconds
5596006 function calls (5026006 primitive calls) in 3.997 seconds
# this branch
6071006 function calls (5501006 primitive calls) in 4.907 seconds
6071006 function calls (5501006 primitive calls) in 4.670 seconds
6071006 function calls (5501006 primitive calls) in 5.659 seconds
6071006 function calls (5501006 primitive calls) in 5.580 seconds
6071006 function calls (5501006 primitive calls) in 5.186 seconds
6071006 function calls (5501006 primitive calls) in 4.431 seconds
6071006 function calls (5501006 primitive calls) in 4.289 seconds
6071006 function calls (5501006 primitive calls) in 4.438 seconds
6071006 function calls (5501006 primitive calls) in 4.344 seconds
6071006 function calls (5501006 primitive calls) in 4.361 seconds
On another, larger file (200 dumps):
# master
6601806 function calls (5506206 primitive calls) in 4.298 seconds
6601806 function calls (5506206 primitive calls) in 4.049 seconds
6601806 function calls (5506206 primitive calls) in 4.004 seconds
6601806 function calls (5506206 primitive calls) in 3.977 seconds
6601806 function calls (5506206 primitive calls) in 4.483 seconds
6601806 function calls (5506206 primitive calls) in 4.453 seconds
6601806 function calls (5506206 primitive calls) in 3.814 seconds
6601806 function calls (5506206 primitive calls) in 3.768 seconds
6601806 function calls (5506206 primitive calls) in 3.846 seconds
6601806 function calls (5506206 primitive calls) in 5.382 seconds
# this branch
7682806 function calls (6587206 primitive calls) in 5.802 seconds
7682806 function calls (6587206 primitive calls) in 4.989 seconds
7682806 function calls (6587206 primitive calls) in 4.673 seconds
7682806 function calls (6587206 primitive calls) in 4.730 seconds
7682806 function calls (6587206 primitive calls) in 4.650 seconds
7682806 function calls (6587206 primitive calls) in 4.689 seconds
7682806 function calls (6587206 primitive calls) in 4.816 seconds
7682806 function calls (6587206 primitive calls) in 4.785 seconds
7682806 function calls (6587206 primitive calls) in 5.059 seconds
7682806 function calls (6587206 primitive calls) in 4.717 seconds
All this (small) overhead is due to kwargs
. If you add indent_step=0
to the arguments the overhead completely disappears:
diff --git a/edn_format/edn_dump.py b/edn_format/edn_dump.py
index cbd8669..1828ec5 100644
--- a/edn_format/edn_dump.py
+++ b/edn_format/edn_dump.py
@@ -92,7 +92,7 @@ def udump(obj,
sort_keys=False,
sort_sets=False,
indent=None,
- **kwargs):
+ indent_step=0):
kwargs = {
"string_encoding": string_encoding,
@@ -100,7 +100,7 @@ def udump(obj,
"sort_keys": sort_keys,
"sort_sets": sort_sets,
"indent": indent,
- "indent_step": kwargs.get("indent_step", 0)
+ "indent_step": indent_step,
}
if obj is None:
# on this branch, with the change above
6601806 function calls (5506206 primitive calls) in 4.249 seconds
(etc)
My bench script, if you want to play with it: https://gist.githubusercontent.com/bfontaine/c3463de9f1ef3a802e8fdce84bfb3adb/raw/297050dc4b790fd5a2b0328397e67907afdcde57/bench.py
I have a strange bug on this EDN:
Well, I will try to investigate however I don't know if I can fix. Even I don't really like this double recursion, and I will admit that is kind difficult to understand what is happening.
All this (small) overhead is due to kwargs. If you add indent_step=0 to the arguments the overhead completely disappears:
Well, back to a function argument I guess. Since the _indent_objs
is now private this maybe fine, however now indent_step
is an argument to udump
(this may be fine though, since the actual API is dumps
AFAIK).
This is what it transformed the EDN:
[
:_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
:_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
:_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
:_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
:_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
:_id :index :guid :isActive :balance :picture :age :eyeColor :name :company :email :phone :address :about :registered :latitude :longitude :tags :range :friends :greeting :favoriteFruit
]
So instead of parsing the array with maps it simply get the keys instead. I will try to investigate what is happening here.
Looks like it’s the seq
call in if isinstance(obj, INDENTABLE_STRUCTURES)
: it’s meant for iterables, so when called on a dict it iterates over its keys rather than the pair of keys. You should add an indented version to the loads(dumps(…))
tests to catch that sort of thing: maybe modify check_roundtrip
so it checks both loads(dumps(…))
and loads(dumps(…, indent=2))
.
I also think it’s important to have code that’s easy to understand so one can easily extend it and/or fix it if there’s a bug. I can’t promess anything, but I should be able to look at this before the end of the week.
I have a strange bug on this EDN:
Fixed. I also add the bench.edn
file and created a test to ensure that this works as it should.
However, I don't like my solution (since it adds even another recursion), so if @bfontaine wants to try if there is a better way to fix it you're welcome.
In commit dcb75e3, I add support for indent=0
case. Before, indent=0
and indent=None
where the same, however now indent=0
pretty prints without any indentation.
This is the same behavior as json.dumps
.
While reviewing I found another issue:
import edn_format
d = [[[1,[2,3],[[4,[5,[6],7]],8],9,[[10]],11],12],[13]]
print(edn_format.dumps(d, indent=2))
[
[
1
2 3
[
4
5 [
6
] 7
] 8
9
[
10
]
11
] 12
13
]
Also:
>>> print(edn_format.dumps([1,2,3],indent=2))
[
1
2
3
]
>>> print(edn_format.dumps([[1,2,3]],indent=2))
[
1 2 3
]
Edit: also:
>>> print(edn_format.dumps([{"a": 42}], keyword_keys=True))
[{:a 42}]
>>> print(edn_format.dumps([{"a": 42}], keyword_keys=True, indent=2))
[
{
"a" 42
}
]
Looking into this I think it’d be a bit simpler if we first tidy up the way we pass around the dump config. Right now all these sort_keys
/sort_sets
/keyword_keys
/etc are passed around in kwargs
every time we call another function. It’s probably the time to stop using plain functions and use an EDNEncoder
object, mirroring json.JSONEncoder
:
# pseudo-code so you understand what I mean
class EDNEncoder(object):
def __init__(self, sort_keys=False, sort_sets=False):
self.sort_keys = sort_keys
self.sort_sets = sort_sets
def encode(self, obj):
if is_collection(obj):
# because all options are on the encoder we don't need to pass them
# as arguments
return '[{}]'.format(" ".join(self._encode_seq(obj)))
else:
return str(obj)
def _encode_seq(self, objs):
return [self.encode(o) for o in objs]
Then:
def dump(obj, **kwargs):
return EDNEncoder(**kwargs).encode(obj)
It would make it easier to add more internal attributes such as indent_step
.
@swaroopch Thoughts? If this is fine for you I can move forward and refactor the code using a class as above.
>>> print(edn_format.dumps([{"a": 42}], keyword_keys=True, indent=2)) [ { "a" 42 } ]
This one is correct, the key and value inside a map shouldn't have a linebreak.
In all other examples, yeah, they're wrong.
>>> print(edn_format.dumps([1,2,3],indent=2)) [ 1 2 3 ] >>> print(edn_format.dumps([[1,2,3]],indent=2)) [ 1 2 3 ]
BTW, this same bug should also happen with tuples since the logic is the same.
>>> print(edn_format.dumps([{"a": 42}], keyword_keys=True, indent=2)) [ { "a" 42 } ]
This one is correct, the key and value inside a map shouldn't have a linebreak.
In all other examples, yeah, they're wrong.
No I mean the map key is not keyword
ized. It works if we dump a map but not when it’s inside a collection.
No I mean the map key is not keywordized. It works if we dump a map but not when it’s inside a collection.
Ah ok, makes sense.
Well, the code kind works for my use case (I have mostly maps that seems to work well in the current iteration) and the algorithm is kind simple (just print the opening symbol at current indent level, indent all entries inside it and print closing symbol at previous current level), however what is making difficult to fix all those issues is this three functions that need coordination to do it (seq
, _udump_indent_collection
and udump
).
I will probably wait for @bfontaine refactor to try again. Or if someone else wants to try and fix it, be my guest.
Substitute PR: https://github.com/swaroopch/edn_format/pull/70
This adds a pretty printer similar to
json.dumps(indent=<int>)
. However, it does not follow Clojure formatting guidelines, instead formatting in a way more common to users from other languages like Python.So it will convert this:
To this:
Instead of this:
This should be already better than the current status quo (that is, no pretty printer at all).
Should fix issue https://github.com/swaroopch/edn_format/issues/39 (unless the author of the issue wanted a more Clojure-like pretty printer).
And sorry about the messy code, I didn't found a way to implement it cleaner.