oven-sh / bun

Incredibly fast JavaScript runtime, bundler, test runner, and package manager – all in one
https://bun.sh
Other
73.56k stars 2.72k forks source link

String and JSON size limits #2570

Open pzmarzly opened 1 year ago

pzmarzly commented 1 year ago

What is the problem this feature would solve?

Both Node.js and Deno have explicit string length limit of 512 MiB. Trying to load bigger string leads to an error:

$ node load.mjs
node:buffer:787
    return this.utf8Slice(0, this.length);
                ^

Error: Cannot create a string longer than 0x1fffffe8 characters
    at Buffer.toString (node:buffer:787:17)
    at JSON.parse (<anonymous>)
    at file:///Users/pzmarzly/tmp/big-json/load.mjs:3:17
    at ModuleJob.run (node:internal/modules/esm/module_job:194:25) {
  code: 'ERR_STRING_TOO_LONG'
}

Node.js v19.3.0

$ deno run load.mjs
┌ ⚠️  Deno requests read access to "big.json".
✅ Granted read access to "big.json".
error: Uncaught TypeError: buffer exceeds maximum length
let data = JSON.parse(readFileSync("big.json"));
                ^
    at TextDecoder.decode (ext:deno_web/08_text_encoding.js:135:22)
    at _utf8Slice (ext:deno_node/internal/buffer.mjs:835:18)
    at Uint8Array.utf8Slice (ext:deno_node/internal/buffer.mjs:729:10)
    at Uint8Array.toString (ext:deno_node/internal/buffer.mjs:418:17)
    at JSON.parse (<anonymous>)
    at file:///Users/pzmarzly/tmp/big-json/load.mjs:3:17

This limitation comes from V8 internals. It makes it annoying to use JS for data analysis, where you often want to load big JSON very briefly and reduce it to sanely-sized value. I then reach for Python, as it doesn't a similar limit. Or use libs like big-json which sadly are slow and inconvenient to use.

From what I noticed, bun doesn't have any limit for readFile{,Sync} (it can load multi-GB strings just fine), but it's JSON.parse silently truncates input around 2GiB.

$ bun load-2steps.mjs
...
3 | let text = readFileSync("big.json");
4 | console.log(text.length); // logs correct size
5 | let data = JSON.parse(text);
               ^
SyntaxError: JSON Parse error: Unexpected EOF

$ # But the file is a valid JSON:
$ cat big.json | jq . > /dev/null
$ echo $?
0

What is the feature you are proposing to solve the problem?

What alternatives have you considered?

No response

paperdave commented 1 year ago

i dont remember if this was bun or not but i swear i hit a string length limit at 6gb. could be wrong. but good catch on json.parse

Jarred-Sumner commented 1 year ago

The problem is here:

https://github.com/oven-sh/WebKit/blob/main/Source/JavaScriptCore/runtime/LiteralParser.h#L98

length is an unsigned 32 bit integer, which goes up to 4,294,967,295 bytes. JavaScriptCore (the engine) stores string length as an unsigned 32 bit integer.

readFileSync returns a Buffer which is a Uint8Array that stores length as an unsigned 64 bit integer, which is more memory than any computer has today.

The current way this works is very inefficient. JSON.parse(obj) internally calls obj.toString("utf-8"). That converts the Uint8Array to a latin1 or UTF-16 string (which clones the entire array) and then attempts to run the JSON parser on it.

Instead, we should implement a direct UTF-8 -> JS version of JSON.parse which skips allocating the temporary string. This would also enable larger JSON sizes. @lemire's simdjson would be the perfect tool for this, and if we did this on the Bun object (like Bun.JSON) we could also have JSON.parseMany or possibly JSONStream with support for ndjson or json lines

ThatOneBro commented 1 year ago

Yeah JSONStream sounds good. I was just thinking about that actually. +1 to that

Jarred-Sumner commented 1 year ago

Looks like using SIMDJSON is faster when the input is all primitives, but the cost of creating identifiers and converting strings means its slower than using native JSON.parse for objects with keys or strings longer than 1 character.

I tried both the on-demand implementation and non-ondemand implementation. This uses the same atom string cache optimization that JSON.parse uses.

benchmark time (avg) (min … max) p75 p99 p995
------------------------------------------------------------------------------ -----------------------------
• small object
------------------------------------------------------------------------------ -----------------------------
JSON.parse 656.38 ns/iter (621.56 ns … 788.83 ns) 665.85 ns 788.83 ns 788.83 ns
JSON.parse (SIMDJSON on-demand buffer) 743.61 ns/iter (720.6 ns … 833.83 ns) 745.12 ns 833.83 ns 833.83 ns

summary for small object
JSON.parse
1.13x faster than JSON.parse (SIMDJSON on-demand buffer)

• Array(4096) of true
------------------------------------------------------------------------------ -----------------------------
JSON.parse 45.42 µs/iter (42.79 µs … 1.91 ms) 44.79 µs 52.71 µs 56.54 µs
JSON.parse (SIMDJSON on-demand buffer) 38.65 µs/iter (35.33 µs … 1.44 ms) 38.58 µs 45.38 µs 50.17 µs

summary for Array(4096) of true
JSON.parse (SIMDJSON on-demand buffer)
1.18x faster than JSON.parse

• Array(4096) of 1234.567
------------------------------------------------------------------------------ -----------------------------
JSON.parse 100.79 µs/iter (96.42 µs … 962.79 µs) 100.08 µs 111.5 µs 115.38 µs
JSON.parse (SIMDJSON on-demand buffer) 62.12 µs/iter (58.13 µs … 751.96 µs) 62.75 µs 71.21 µs 75.96 µs

summary for Array(4096) of 1234.567
JSON.parse (SIMDJSON on-demand buffer)
1.62x faster than JSON.parse

• Array(4096) of 'hello'
------------------------------------------------------------------------------ -----------------------------
JSON.parse 142.44 µs/iter (132.75 µs … 1.38 ms) 141.33 µs 159.42 µs 169.54 µs
JSON.parse (SIMDJSON on-demand buffer) 196.67 µs/iter (130.54 µs … 1.9 ms) 203.5 µs 234.5 µs 407.46 µs

summary for Array(4096) of 'hello'
JSON.parse
1.38x faster than JSON.parse (SIMDJSON on-demand buffer)

• Array(4096) of 'hello'.repeat(1024)
------------------------------------------------------------------------------ -----------------------------
JSON.parse 9.8 ms/iter (9.07 ms … 11.26 ms) 10.19 ms 11.26 ms 11.26 ms
JSON.parse (SIMDJSON on-demand buffer) 6.39 ms/iter (5.9 ms … 9 ms) 6.74 ms 9 ms 9 ms

summary for Array(4096) of 'hello'.repeat(1024)
JSON.parse (SIMDJSON on-demand buffer)
1.53x faster than JSON.parse

• Array(4096) of {a: 123, b: 456}
------------------------------------------------------------------------------ -----------------------------
JSON.parse 310.68 µs/iter (297.96 µs … 1.14 ms) 308.25 µs 386.33 µs 752.25 µs
JSON.parse (SIMDJSON on-demand buffer) 413.16 µs/iter (398.67 µs … 1.13 ms) 411.88 µs 474.38 µs 717.29 µs

summary for Array(4096) of {a: 123, b: 456}
JSON.parse
1.33x faster than JSON.parse (SIMDJSON on-demand buffer)
Benchmark: ```js import { bench, group, run } from "mitata"; function load(obj) { const asStr = JSON.stringify(obj); const buffer = Buffer.from(asStr); bench("JSON.parse", () => { return JSON.parse(asStr); }); bench("JSON.parse (SIMDJSON on-demand buffer)", () => { return buffer.json(); }); } group("small object", () => { var obj = { a: 1, b: 2, c: null, false: false, true: true, null: null, foo: "bar", arr: [1, 2, 3], h: { a: 1, }, i: { a: 1, }, j: {}, // 100 more keys k: {}, }; load(obj); }); group("Array(4096) of true", () => { var obj = Array(4096); obj.length = 4096; obj.fill(true); load(obj); }); group("Array(4096) of 1234.567", () => { var obj = Array(4096); obj.length = 4096; obj.fill(1234.567); load(obj); }); group("Array(4096) of 'hello'", () => { var obj = Array(4096); obj.length = 4096; obj.fill("hello"); load(obj); }); group("Array(4096) of 'hello'.repeat(1024)", () => { var obj = Array(4096); obj.length = 4096; obj.fill("hello".repeat(1024)); load(obj); }); group("Array(4096) of {a: 123, b: 456}", () => { var obj = Array(4096); obj.length = 4096; obj.fill({ a: 123, b: 456 }); load(obj); }); run(); ```

Code: https://github.com/oven-sh/bun/commit/84a9fac3158c0b90151c4a154c94d89ac90aa11c

lemire commented 1 year ago

Ping me if you think I can help.

valstu commented 3 months ago

This would be great feature, manipulating big files (especially json) with node.js can get quite tricky.