ibmruntimes / yieldable-json

Asynchronous JSON parser and stringify APIs that make use of generator patterns
Other
145 stars 22 forks source link

Big JSON tests #9

Closed hrdwdmrbl closed 5 years ago

hrdwdmrbl commented 5 years ago

So I made a couple tests using very large JSON. I am in search of some solutions for dealing with very large JSON, including up to 2gb in size. I thought that this library might be perfect for our needs, but from my testing I'm having some trouble. Can you take a look at my test and see if I'm mis-using the library?

  1. Make sure that we don't get loop starvation. I measure loop starvation in 3 ways. a) See how many times a setInterval with 0ms is called b) See how long the longest period between each setInterval is called c) See how long each tick takes (using blocked-at)
  2. Make sure that yieldable parsing doesn't take so much longer than regular parsing.
hrdwdmrbl commented 5 years ago
test/test-parse-big.js ................................ 0/4 14s
  not ok Yieldable parse took 10x longer than regular parse 13341 vs 260
    --- wanted
    +++ found
    -true
    +false
    compare: ===
    at:
      line: 64
      column: 9
      file: test/test-parse-big.js
      function: yj.parseAsync
    stack: |
      yj.parseAsync (test/test-parse-big.js:64:9)
      Immediate.setImmediate (yieldable-parser.js:335:20)
    source: >
      tap.equal(yieldParseElapsedTime / 10 < nonYieldParseElapsedTime, true,
      `Yieldable parse took 10x longer than regular parse ${yieldParseElapsedTime}
      vs ${nonYieldParseElapsedTime}`);

  not ok Not enough yielding 1
    --- wanted
    +++ found
    -true
    +false
    compare: ===
    at:
      line: 65
      column: 9
      file: test/test-parse-big.js
      function: yj.parseAsync
    stack: |
      yj.parseAsync (test/test-parse-big.js:65:9)
      Immediate.setImmediate (yieldable-parser.js:335:20)
    source: |
      tap.equal(yieldCount > 10, true, `Not enough yielding ${yieldCount}`);

  not ok Loop starved too long 13110
    --- wanted
    +++ found
    -true
    +false
    compare: ===
    at:
      line: 66
      column: 9
      file: test/test-parse-big.js
      function: yj.parseAsync
    stack: |
      yj.parseAsync (test/test-parse-big.js:66:9)
      Immediate.setImmediate (yieldable-parser.js:335:20)
    source: >
      tap.equal(longestLoopStarvation < 100, true, `Loop starved too long
      ${longestLoopStarvation}`);

  not ok Tick took too long 299
    --- wanted
    +++ found
    -true
    +false
    compare: ===
    at:
      line: 69
      column: 13
      file: test/test-parse-big.js
      type: Timeout
      function: setTimeout
      method: _onTimeout
    stack: |
      Timeout.setTimeout (test/test-parse-big.js:69:13)
    source: >
      tap.equal(longestTickTime < 100, true, `Tick took too long
      ${longestTickTime}`);

total ................................................. 0/4

  0 passing (14s)
  4 failing
gireeshpunathil commented 5 years ago

@hrdwdmrbl - thanks for the PR. I will have a look!

gireeshpunathil commented 5 years ago

@hrdwdmrbl - I had a detailed look at the test case. While there is nothing wrong with the usage, I would say it exploits the boundary of the tool's capability in terms of how much data to parse in one loop cycle.

If you look at https://github.com/ibmruntimes/yieldable-json/blob/e19f4825e5bf5c17a2d73497d7d14f0774c42c52/yieldable-parser.js#L167-L170 , my logic is to run 512 * intensity iterations in one shot before yielding to the event loop. In your case, intensity is default (1), so it is 512 units of work.

Now if you look at your JSON layout, it is an object that has 22 (4) arrays, each containing 27 (128) elements, amounting to 512 units of work. So when this is put into operation, the yieldable-parser does not yield even once, as we have exactly 512 units of work!

If your test case is a true representative of a real workload that you have, I would recommend you reducing the yield counter to something lesser, and to your convenience.

On the contrary, if this test case is found to be representing more common use case, I could change the default, and users can increase the value by adding the multiplier (intensity) to their convenience.

My rationale for keeping 512 is that it caters to common workload, and that way people can work with default configuration.

gireeshpunathil commented 5 years ago

now coming to this PR, I would recommend these:

Hope this helps!

hrdwdmrbl commented 5 years ago

Yes that does help. I see now that it was because of the way I built my big JSON. Good to understand it at a deeper level. It was not a production-like blob, just wanted something really big. I'll make changes and retry, thanks

hrdwdmrbl commented 5 years ago

Okay yeah, so I created a more balanced (probably more realistic) big JSON and success

Object size 6186728112 bytes
Total time 3531ms

yieldCount 514 times
longestLoopStarvation 16ms
longestTickTime 23ms
gireeshpunathil commented 5 years ago

thanks for verifying and getting back @hrdwdmrbl - do you think improving the documentation around this specific aspect of the tool (intensity, time to return to loop) would be a good idea? I have it documented, but looks like it is not readily consumable or something? please let me know!

hrdwdmrbl commented 5 years ago

@gireeshpunathil The documentation explains the intensity option, but it doesn't explain how it works internally like you did.

It might make sense to instead expose an option which can directly control the number of iterations to perform. So if the option is specified, the number provided is used instead of 512 * intensity

hrdwdmrbl commented 5 years ago

Ah, @gireeshpunathil with real-world JSON I'm getting blockages up to 87 seconds. :( I'm working to get a copy of one of those real-world JSONs so that I can do further testing.

gireeshpunathil commented 5 years ago

@hrdwdmrbl - thanks. I will look forward to review the real world JSON data.