Closed JoakimSoderberg closed 4 years ago
+1
in my experience - it slowed down a lot some time since early January
> yq -V
yq version 3.2.1
same for
> yq -V
yq version 3.3.0
I am imagining an O(N^2) algorithm somewhere inappropriate
after downgrade to yq 2.4.1 --> zippy fast again (ie: 30s to parse yaml file in 3.x became sub second response time with 2.4.1)
this helped to downgrade -> https://stackoverflow.com/questions/53837861/brew-list-and-install-specific-versions-of-formula
Ah yeah, this will be when I moved to 3.0. As part of handling yaml anchors on conversion to JSON I made it iterate through the yaml structure. Let me play around to see if I can do that better...or even skip it
Ok what I can do is make it not handle merge anchors by default when converting to JSON - however if there are merge anchors in the yaml file, then you will need to pass in --explodeAnchors/-X
with the -j
flag. Otherwise, the command will fail (and suggest to you to retry with --explodeAnchors
). Thoughts?
Ok what I can do is make it not handle merge anchors by default when converting to JSON - however if there are merge anchors in the yaml file, then you will need to pass in
--explodeAnchors/-X
with the-j
flag. Otherwise, the command will fail (and suggest to you to retry with--explodeAnchors
). Thoughts?
Hey Mike -- will the "--explodeAnchors' still be slow? if so, does not get my vote. I have written plenty of one-pass parsers, and what it has boiled down to in the past is using some memory to cache the anchors (macros) as they are found, then doing an efficient look-up (hashed by name) at the point of reference.
or, if by merge anchors
you mean the special key <<
, then sort by key name both the LHS and RHS of the merge and then do a zipper O(1) across the key lists?
Yep will still be slow, that sounds pretty neat, would def take a PR for it :)
@mikefarah I am loving to take a look and do a PR -- before beginning, is there some architectural change reason that 2.4.x works quickly and 3.x is slow?
Yeah so 2.4.x didn't handle anchors, it was because the underlying yaml parser did not expose them.
In 3.x I updated the underlying parser which let me access the merge anchor information.
Now when outputting to json - I needed to process the anchors into a regular object to I can encode to JSON correctly, this wasn't done in 2.4.x.
In 3.x, it is done here: https://github.com/mikefarah/yq/blob/master/cmd/utils.go#L184 before being passed to the encoder (https://github.com/mikefarah/yq/blob/master/pkg/yqlib/encoder.go#L61)
Further context, the way explodeAnchors works, (it's quite naive :)) is to deeply recurse all matching nodes and merge all the leaf nodes into a new empty data structure
+1:
A yaml input file of mine is 827 lines long and makes heavy, multi-level use of anchors and aliases; exploding it into a 10294-line output file takes 10-12 minutes. In some sitations, I can speed up development by exploding it into a json file and loading that using jq
(which takes <1s), but obviously, whenever the original yaml file changes, I am stalled. Only workaround for now is to comment out the aliases that have not changed.
I do understand that anchor / alias support was implemented in a naive way in order to support them at all - and @mikefarah: I do sincerely appreciate the great work 👍 - anchors and aliases enable me to do fantastic stuff with yaml - but whatever makes this faster would bring a significant benefit to my work.
BTW: I know nothing about Go, but from I've heard, it's really good at parallelizing tasks ("co-routines" and what not ?!?); looking at top
while the exploding is running shows yq
using ~180% CPU (e.g. CPU usage: 25% user, 5% sys, 70% idle
) on a mid-2015 MacBook Pro (Intel Core i7 Quad Core); wouldn't this ideally max out my CPU ? (at - I don't know - 400% / 95% user
or so ?!?)
@PaulCharlton: Is there anything I can do to support you with the PR you mention ? I'm able and more than happy to test dev versions and report benchmarks - somewhat self-serving, would get me a chance to get GOing (jeez, my puns have seen better days...)
Update: My yaml input file keeps growing (e.g. 958 lines, including comments) and making smarter re-use of multi-level sections used at multiple locations (e.g. 63972 lines in the json output file). YAY to yaml anchors and aliases! 🎉
However, the time yq requires to explode the yaml file and/or convert it to json is getting prohibitive; from my log output: Time to load yaml file: 1h 19m 53s
.
In order to keep making progress in my project, I will have to stop using yq for the time being and work around this issue by converting the yaml input file to a json output file up front; takes <1s with a fairly simple Python script.
Alright I think I have a good way forward with this - working on it, should be available within a few days if I get enough time. Does anyone have a sample large doc to test it with? I've made my own but not sure how representative it is...
I re-ran the file I mentioned earlier, takes less than a second. That's one heck of a speed-up! @mikefarah : Great work mate! 👍
Fantastic, glad to hear that :)
Thanks for the feedback
Describe the bug When converting a 5000 line YAML to JSON using
yq
:It takes 2 minutes:
Similarly, a 500 line file takes around 5s, which is in itself super slow.
Input Yaml
5000 lines of YAML, not pasting that here.
Command The command you ran:
Actual behavior
Super slow
Expected behavior
It should take under a second. That's what a simple Python program does it in.
Additional context Add any other context about the problem here.