mikefarah / yq

yq is a portable command-line YAML, JSON, XML, CSV, TOML and properties processor
https://mikefarah.gitbook.io/yq/
MIT License
12.3k stars 602 forks source link

Converting YAML to JSON is super slow #422

Closed JoakimSoderberg closed 4 years ago

JoakimSoderberg commented 4 years ago

Describe the bug When converting a 5000 line YAML to JSON using yq:

time docker run \                                                      
        --rm \
        -u $(id -ru):$(id -rg) \
        -v $(pwd):/shared \
        -w /shared \
        mikefarah/yq yq r --tojson swagger.json

It takes 2 minutes:

...
docker run --rm -u $(id -ru):$(id -rg) -v $(pwd):/shared -w /shared  yq r    0,07s user 0,03s system 0% cpu 2:18,42 total

Similarly, a 500 line file takes around 5s, which is in itself super slow.

Input Yaml

5000 lines of YAML, not pasting that here.

Command The command you ran:

yq r --tojson swagger.yml

Actual behavior

Super slow

Expected behavior

It should take under a second. That's what a simple Python program does it in.

Additional context Add any other context about the problem here.

PaulCharlton commented 4 years ago

+1

PaulCharlton commented 4 years ago

in my experience - it slowed down a lot some time since early January


> yq -V
yq version 3.2.1
PaulCharlton commented 4 years ago

same for

> yq -V
yq version 3.3.0

I am imagining an O(N^2) algorithm somewhere inappropriate

PaulCharlton commented 4 years ago

after downgrade to yq 2.4.1 --> zippy fast again (ie: 30s to parse yaml file in 3.x became sub second response time with 2.4.1)

this helped to downgrade -> https://stackoverflow.com/questions/53837861/brew-list-and-install-specific-versions-of-formula

mikefarah commented 4 years ago

Ah yeah, this will be when I moved to 3.0. As part of handling yaml anchors on conversion to JSON I made it iterate through the yaml structure. Let me play around to see if I can do that better...or even skip it

mikefarah commented 4 years ago

Ok what I can do is make it not handle merge anchors by default when converting to JSON - however if there are merge anchors in the yaml file, then you will need to pass in --explodeAnchors/-X with the -j flag. Otherwise, the command will fail (and suggest to you to retry with --explodeAnchors). Thoughts?

PaulCharlton commented 4 years ago

Ok what I can do is make it not handle merge anchors by default when converting to JSON - however if there are merge anchors in the yaml file, then you will need to pass in --explodeAnchors/-X with the -j flag. Otherwise, the command will fail (and suggest to you to retry with --explodeAnchors). Thoughts?

Hey Mike -- will the "--explodeAnchors' still be slow? if so, does not get my vote. I have written plenty of one-pass parsers, and what it has boiled down to in the past is using some memory to cache the anchors (macros) as they are found, then doing an efficient look-up (hashed by name) at the point of reference.

PaulCharlton commented 4 years ago

or, if by merge anchors you mean the special key <<, then sort by key name both the LHS and RHS of the merge and then do a zipper O(1) across the key lists?

mikefarah commented 4 years ago

Yep will still be slow, that sounds pretty neat, would def take a PR for it :)

PaulCharlton commented 4 years ago

@mikefarah I am loving to take a look and do a PR -- before beginning, is there some architectural change reason that 2.4.x works quickly and 3.x is slow?

mikefarah commented 4 years ago

Yeah so 2.4.x didn't handle anchors, it was because the underlying yaml parser did not expose them.

In 3.x I updated the underlying parser which let me access the merge anchor information.

Now when outputting to json - I needed to process the anchors into a regular object to I can encode to JSON correctly, this wasn't done in 2.4.x.

In 3.x, it is done here: https://github.com/mikefarah/yq/blob/master/cmd/utils.go#L184 before being passed to the encoder (https://github.com/mikefarah/yq/blob/master/pkg/yqlib/encoder.go#L61)

mikefarah commented 4 years ago

Further context, the way explodeAnchors works, (it's quite naive :)) is to deeply recurse all matching nodes and merge all the leaf nodes into a new empty data structure

dd-ssc commented 4 years ago

+1:

A yaml input file of mine is 827 lines long and makes heavy, multi-level use of anchors and aliases; exploding it into a 10294-line output file takes 10-12 minutes. In some sitations, I can speed up development by exploding it into a json file and loading that using jq (which takes <1s), but obviously, whenever the original yaml file changes, I am stalled. Only workaround for now is to comment out the aliases that have not changed.

I do understand that anchor / alias support was implemented in a naive way in order to support them at all - and @mikefarah: I do sincerely appreciate the great work 👍 - anchors and aliases enable me to do fantastic stuff with yaml - but whatever makes this faster would bring a significant benefit to my work.

BTW: I know nothing about Go, but from I've heard, it's really good at parallelizing tasks ("co-routines" and what not ?!?); looking at top while the exploding is running shows yq using ~180% CPU (e.g. CPU usage: 25% user, 5% sys, 70% idle) on a mid-2015 MacBook Pro (Intel Core i7 Quad Core); wouldn't this ideally max out my CPU ? (at - I don't know - 400% / 95% user or so ?!?)

@PaulCharlton: Is there anything I can do to support you with the PR you mention ? I'm able and more than happy to test dev versions and report benchmarks - somewhat self-serving, would get me a chance to get GOing (jeez, my puns have seen better days...)

dd-ssc commented 4 years ago

Update: My yaml input file keeps growing (e.g. 958 lines, including comments) and making smarter re-use of multi-level sections used at multiple locations (e.g. 63972 lines in the json output file). YAY to yaml anchors and aliases! 🎉

However, the time yq requires to explode the yaml file and/or convert it to json is getting prohibitive; from my log output: Time to load yaml file: 1h 19m 53s.

In order to keep making progress in my project, I will have to stop using yq for the time being and work around this issue by converting the yaml input file to a json output file up front; takes <1s with a fairly simple Python script.

mikefarah commented 4 years ago

Alright I think I have a good way forward with this - working on it, should be available within a few days if I get enough time. Does anyone have a sample large doc to test it with? I've made my own but not sure how representative it is...

dd-ssc commented 4 years ago

I re-ran the file I mentioned earlier, takes less than a second. That's one heck of a speed-up! @mikefarah : Great work mate! 👍

mikefarah commented 4 years ago

Fantastic, glad to hear that :)

Thanks for the feedback