vincev / dply-rs

A data manipulation tool for parquet, csv, and json data.
Apache License 2.0
42 stars 3 forks source link

Error: Unknown function: json #47

Closed fdncred closed 1 year ago

fdncred commented 1 year ago

I was trying to follow along with your readme and process some json data but I can't seem to get it to work. Am I doing something wrong?

> dply -c 'json("./buildtimes.json") | show()'
Error: Unknonw function: json

Also, when I go in the repl, I don't see a json( function like the parquet( function.

vincev commented 1 year ago

Hi @fdncred, I didn't release the version with the json function, it should work if you cargo install from source.

fdncred commented 1 year ago

ah, ok. i'll try that. thanks! i'm trying to get something kind of working with nushell. we'll see how it goes.

vincev commented 1 year ago

okay let me know how it goes, I am going to release 0.2.0 to crates.io.

fdncred commented 1 year ago

Kind of striking out. Test 1 - Didn't really think this would work because it would have to evaluate nushell's open in the json() function.

dply -c 'json(open ~\.local\share\nushell\startup-times.nuon | where build == release) | select(commit) | show()'

Test 2 - expected it to work but maybe i'm doing something wrong

dply -c 'json("buildtimes.json") | select(commit) | show()'
Error: Arrow error: Json error: Not valid JSON: EOF while parsing a list at line 1 column 1

Caused by:
    Json error: Not valid JSON: EOF while parsing a list at line 1 column 1

had to rename it .txt to get github to allow it buildtimes.txt

This worked as parquet but I can't get perf to be a duration. Not sure how that works exactly.

❯ dply -c 'parquet("buildtimes.parquet") |
❯❯❯ group_by(commit) |
❯❯❯ summarize(min_date = min(date),
❯❯❯ max_date = max(date),
❯❯❯ cmt_count = n(),
❯❯❯ perf_ns = mean(time)) |
❯❯❯ arrange(min_date, max_date) |
❯❯❯ show()'
shape: (29, 5) elapsed: 0.009s
┌──────────────────────────────────────────┬───────────────────────────────┬───────────────────────────────┬────────────────┬──────────────────┐
│ commit                                   ┆ min_date                      ┆ max_date                      ┆ cmt_count      ┆ perf_ns          │
│ ---                                      ┆ ---                           ┆ ---                           ┆ ---            ┆ ---              │
│ str                                      ┆ datetime[ns]                  ┆ datetime[ns]                  ┆ i64            ┆ f64              │
╞══════════════════════════════════════════╪═══════════════════════════════╪═══════════════════════════════╪════════════════╪══════════════════╡
│ 2bb0c1c618f961843b49432fb7a21304b41493af ┆ 2023-07-03T17:03:08.605833100 ┆ 2023-07-03T17:45:46.859419300 ┆ 2              ┆ 120679900.0      │
│ 406b606398bf18c98063fbe998a4d27f75067eef ┆ 2023-07-05T12:41:06.186069800 ┆ 2023-07-07T13:31:08.454208    ┆ 40             ┆ 129605145.0      │
│ 8e38596bc9494357f01f166076e8d563f28016f3 ┆ 2023-07-07T13:35:20.692982100 ┆ 2023-07-10T15:42:42.746287500 ┆ 13             ┆ 132144692.307692 │
│ cf36f052c46b6efe57500e3acb7f52d2d0cb8d2e ┆ 2023-07-12T15:10:55.382326800 ┆ 2023-07-12T19:23:46.223713400 ┆ 2              ┆ 184785800.0      │
│ b2043135ed956ead0d3b5d5df49ea9d929dc7120 ┆ 2023-07-12T20:56:26.198039100 ┆ 2023-07-13T20:14:08.484523    ┆ 2              ┆ 136068100.0      │
│ 4804e6a151ca0f212c3f4b097b4d805a69535149 ┆ 2023-07-14T16:39:30.111034    ┆ 2023-07-14T20:25:04.629634800 ┆ 7              ┆ 144062385.714286 │
│ 48271d8c3e1f83723f005ae1809ebd5026783f8a ┆ 2023-07-17T13:09:13.074710300 ┆ 2023-07-17T19:20:06.502104    ┆ 3              ┆ 149020600.0      │
│ a5a79a7d95822bc143090612e1813f3b06befbf4 ┆ 2023-07-18T16:44:35.140979400 ┆ 2023-07-20T16:01:20.985936100 ┆ 10             ┆ 159522440.0      │
│ 9db0d6bd34a99805c6da296688aa186778be5a86 ┆ 2023-07-24T12:57:55.020534400 ┆ 2023-07-24T13:15:05.868456200 ┆ 3              ┆ 138853133.333333 │
│ 208071916209af5a4159b131e438aa6cab524532 ┆ 2023-07-25T12:22:59.358805    ┆ 2023-07-25T17:46:46.435686600 ┆ 5              ┆ 231826620.0      │
│ a33b5fe6ce97b5e9fe8a774c13e783ed65c1b591 ┆ 2023-07-25T20:39:13.457784500 ┆ 2023-07-27T14:53:35.224492100 ┆ 7              ┆ 200283471.428571 │
│ f8d325dbfef5fec7ee109e37c624236998de8843 ┆ 2023-07-27T15:05:15.536044300 ┆ 2023-07-27T15:26:56.575707    ┆ 4              ┆ 114143575.0      │
│ 6aa30132aae188639a78ba8fd7feddc952d5792e ┆ 2023-07-27T15:56:29.689214800 ┆ 2023-07-27T18:06:09.548466600 ┆ 8              ┆ 119858087.5      │
│ 8403fff34500d30439545519c88c7d942c717e3e ┆ 2023-07-27T20:02:18.214354    ┆ 2023-07-28T14:01:34.190749100 ┆ 3              ┆ 131968033.333333 │
│ 94bec720791f716b44cb23db363f53e2fa7acce3 ┆ 2023-07-31T13:00:49.600253500 ┆ 2023-08-01T13:41:42.960650600 ┆ 21             ┆ 123044576.190476 │
│ f6033ac5af75073dddce2400304448dbbadd0318 ┆ 2023-08-01T14:35:39.008614900 ┆ 2023-08-01T18:00:31.016108300 ┆ 2              ┆ 155370400.0      │
│ 778a00efa10735e7eb368aea1ddfeb6af3d3720a ┆ 2023-08-01T20:38:56.294492600 ┆ 2023-08-02T15:11:37.531105600 ┆ 4              ┆ 152309825.0      │
│ ec4941c8ac45f94ab408753b173ce991ce0fafd3 ┆ 2023-08-02T16:05:31.403641700 ┆ 2023-08-02T16:05:31.403641700 ┆ 1              ┆ 121661700.0      │
vincev commented 1 year ago

I see, the json function works for ndjson (I need to document that), you can convert it using jq:

cat buildtimes.json| jq -c '.[]' > buildtimesnd.json

I got this as result using the converted file buildtimesnd.txt:

〉json("./buildtimesnd.json") | glimpse()
::: 
┌────────────┬────────┬──────────────────────────────────────────────┐
│ Rows: 179  ┆ Type   ┆ Values                                       │
│ Cols: 7    ┆        ┆                                              │
╞════════════╪════════╪══════════════════════════════════════════════╡
│ allocator  ┆ str    ┆ mimalloc, mimalloc, mimalloc, mimalloc,...   │
│ build      ┆ str    ┆ release, release, release, release, relea... │
│ build_time ┆ str    ┆ 2023-07-03 10:40:42 -05:00, 2023-07-03...    │
│ commit     ┆ str    ┆ 2bb0c1c618f961843b49432fb7a21304b41493af,... │
│ date       ┆ str    ┆ 2023-07-03 12:03:08.605833100 -05:00,...     │
│ time       ┆ i64    ┆ 132610400, 108749400, 169129700, 13610890... │
│ version    ┆ str    ┆ 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1,...   │
└────────────┴────────┴──────────────────────────────────────────────┘
fdncred commented 1 year ago

I get different results image

fdncred commented 1 year ago

apparently, the filename has to be named with the .json extension, not .txt or .jsonl or .ndjson or any other extension.

❯ dply -c 'json("buildtimesnd.json") | glimpse()'
┌────────────┬────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Rows: 179  ┆ Type   ┆ Values                                                                                                                                            │
│ Cols: 7    ┆        ┆                                                                                                                                                   │
╞════════════╪════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ allocator  ┆ str    ┆ mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc, mimalloc,...    │
│ build      ┆ str    ┆ release, release, release, release, release, release, release, release, release, release, release, release, release, release, release, release... │
│ build_time ┆ str    ┆ 2023-07-03 10:40:42 -05:00, 2023-07-03 10:40:42 -05:00, 2023-07-05 07:36:05 -05:00, 2023-07-05 07:36:05 -05:00, 2023-07-05 07:36:05 -05:00,...    │
│ commit     ┆ str    ┆ 2bb0c1c618f961843b49432fb7a21304b41493af, 2bb0c1c618f961843b49432fb7a21304b41493af, 406b606398bf18c98063fbe998a4d27f75067eef,...                  │
│ date       ┆ str    ┆ 2023-07-03 12:03:08.605833100 -05:00, 2023-07-03 12:45:46.859419300 -05:00, 2023-07-05 07:41:06.186069800 -05:00, 2023-07-05 12:31:59.73713950... │
│ time       ┆ i64    ┆ 132610400, 108749400, 169129700, 136108900, 110939500, 106339100, 221954000, 125643500, 132976300, 131243200, 136889900, 110602800, 110799700,... │
│ version    ┆ str    ┆ 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1, 0.82.1... │
└────────────┴────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
vincev commented 1 year ago

ah that's interesting, it looks it only looks for .json files to handle reading multiple partitioned files under a folder:

$ ls data
buildtimesnd.txt   buildtimesnd1.json buildtimesnd2.json

then passing the data folder only reads .json files:

dply -c 'config(max_table_width=50); json("data") | glimpse()'
┌────────────┬────────┬──────────────────────────┐
│ Rows: 358  ┆ Type   ┆ Values                   │
│ Cols: 7    ┆        ┆                          │
╞════════════╪════════╪══════════════════════════╡
│ allocator  ┆ str    ┆ mimalloc, mimalloc,...   │
│ build      ┆ str    ┆ release, release,...     │
│ build_time ┆ str    ┆ 2023-07-03 10:40:42...   │
│ commit     ┆ str    ┆ 2bb0c1c618f961843b494... │
│ date       ┆ str    ┆ 2023-07-03...            │
│ time       ┆ i64    ┆ 132610400, 108749400,... │
│ version    ┆ str    ┆ 0.82.1, 0.82.1, 0.82.... │
└────────────┴────────┴──────────────────────────┘

I'll see if it is possible to override this behavior if the user pass an extension so that it reads the file as ndjson.

vincev commented 1 year ago

Thank you @fdncred I changed the behavior in #53 so that the default extension is only used when loading form a folder without specifying extension:

dply -c 'config(max_table_width=50); json("buildtimes.txt") | glimpse()'
┌────────────┬────────┬──────────────────────────┐
│ Rows: 179  ┆ Type   ┆ Values                   │
│ Cols: 7    ┆        ┆                          │
╞════════════╪════════╪══════════════════════════╡
│ allocator  ┆ str    ┆ mimalloc, mimalloc,...   │
│ build      ┆ str    ┆ release, release,...     │
│ build_time ┆ str    ┆ 2023-07-03 10:40:42...   │
│ commit     ┆ str    ┆ 2bb0c1c618f961843b494... │
│ date       ┆ str    ┆ 2023-07-03...            │
│ time       ┆ i64    ┆ 132610400, 108749400,... │
│ version    ┆ str    ┆ 0.82.1, 0.82.1, 0.82.... │
└────────────┴────────┴──────────────────────────┘
fdncred commented 1 year ago

Thanks for the follow-up. I haven't tried it out yet, but I looked at the PR and it seemed reasonable. Appreciate the work!