pola-rs / nodejs-polars

nodejs front-end of polars
https://pola-rs.github.io/nodejs-polars/
MIT License
437 stars 44 forks source link

pl.readJSON Fails on JSON with Newline Characters Despite "lines" Format Setting #260

Closed denisgermano closed 2 months ago

denisgermano commented 2 months ago

Have you tried latest version of polars?

What version of polars are you using?

nodejs-polars@0.15.0

What operating system are you using polars on?

MacOS 14.6.1 M2 Max

What node version are you using

Node v22.7.0

Describe your bug.

When using the pl.readJSON function to load NDJSON data, the function fails if any JSON string contains a newline character (\n). This issue is present even when the format option is set to "lines" as per the documentation.

What are the steps to reproduce the behavior?

const pl = require("nodejs-polars");

let jsonData = `
{"id":"2489651051","type":"PushEvent"}
{"id":"2489651045","type":"Create\nEvent"}
{"id":"2489651053","type":"PushEvent"}
`;
let df2 = pl.readJSON(jsonData, { format: "lines" })
console.log("FROM READ", df2);

What is the actual behavior?

Raise syntax error on parsing ndjson

/Users/denis.germano/node_modules/nodejs-polars/bin/io.js:137
            return (0, dataframe_1._DataFrame)(method(Buffer.from(pathOrBody, "utf-8"), options));
                                               ^

Error: Syntax at character 0
    at Object.readJSON (/Users/denis.germano/node_modules/nodejs-polars/bin/io.js:137:48)
    at Object.<anonymous> (/Users/denis.germano/Downloads/example_polars/poc-wip.js:19:14)
    at Module._compile (node:internal/modules/cjs/loader:1546:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1691:10)
    at Module.load (node:internal/modules/cjs/loader:1317:32)
    at Module._load (node:internal/modules/cjs/loader:1127:12)
    at TracingChannel.traceSync (node:diagnostics_channel:315:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:217:24)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:166:5)
    at node:internal/main/run_main_module:30:49 {
  code: 'GenericFailure'
}

Node.js v22.7.0

What is the expected behavior?

To parse correctly as in a Stream

const pl = require("nodejs-polars");
const Stream = require('stream');

const readStream = new Stream.Readable({ read() { } });
readStream.push(`${JSON.stringify({ "id": "2489651051", "type": "PushEvent" })} \n`);
readStream.push(`${JSON.stringify({ "id": "2489651045", "type": "Create\nEvent" })} \n`);
readStream.push(`${JSON.stringify({ "id": "2489651053", "type": "PushEvent" })} \n`);
readStream.push(null);

pl.readJSONStream(readStream, { format: "lines" }).then(
    df1 => console.log("FROM STREAM", df1)
)

Results

FROM STREAM shape: (3, 2)
┌────────────┬───────────┐
│ id         ┆ type      │
│ ---        ┆ ---       │
│ str        ┆ str       │
╞════════════╪═══════════╡
│ 2489651051 ┆ PushEvent │
│ 2489651045 ┆ Create    │
│            ┆ Event     │
│ 2489651053 ┆ PushEvent │
└────────────┴───────────┘

What do you think polars should have done? Escape inner \n

Bidek56 commented 2 months ago

This is an issue with the core Rust engine. I get the same error in py-polars:

import polars as pl
from io import StringIO
json_str = '[{"foo":"foo\nfoo","bar":6},{"foo":2,"bar":7},{"foo":3,"bar":"8\nfoo"}]'

pl.read_json(StringIO(json_str))
    pydf = PyDataFrame.read_json(
           ^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: Syntax at character 0

Please raise this issue with the core team and close this ticket. I wish I could transfer this ticket to the core team but I do not have the permission. Thx

denisgermano commented 2 months ago

Thanks @Bidek56 Issue on core rust polars: https://github.com/pola-rs/polars/issues/18535