vincev / dply-rs

A data manipulation tool for parquet, csv, and json data.
Apache License 2.0
42 stars 3 forks source link

question: can you cast data types #50

Closed fdncred closed 1 year ago

fdncred commented 1 year ago

In #47, I learned how to open a json file, as long as it's jsonl/ndjson and named with a .json file extension. Now I have a int64 field that needs to be interpreted as a timestamp(nanoseconds). Is there a way to cast the field to some type of a duration or timestamp datatype?

vincev commented 1 year ago

As it is now there is only a dt function to convert date strings to datetime:

dply -c 'json("./buildtimesnd.json") |
mutate(date=dt(date))|
group_by(commit) |
summarize(min_date = min(date), max_date = max(date)) |
mutate(date_diffs=max_date-min_date)|
head(3)'
shape: (3, 4) elapsed: 0.005s
┌──────────────────────────────────────────┬─────────────────────────┬─────────────────────────┬──────────────────┐
│ commit                                   ┆ min_date                ┆ max_date                ┆ date_diffs       │
│ ---                                      ┆ ---                     ┆ ---                     ┆ ---              │
│ str                                      ┆ datetime[ms]            ┆ datetime[ms]            ┆ interval[dt]     │
╞══════════════════════════════════════════╪═════════════════════════╪═════════════════════════╪══════════════════╡
│ 406b606398bf18c98063fbe998a4d27f75067eef ┆ 2023-07-05T12:41:06.186 ┆ 2023-07-07T13:31:08.454 ┆ 2D 0h 50m 2.268s │
│ 8e38596bc9494357f01f166076e8d563f28016f3 ┆ 2023-07-07T13:35:20.692 ┆ 2023-07-10T15:42:42.746 ┆ 3D 2h 7m 22.054s │
│ 6f5bd62a974c4b8660783b792b18c7597a805c1e ┆ 2023-08-09T17:43:54.095 ┆ 2023-08-09T19:50:25.802 ┆ 2h 6m 31.707s    │
└──────────────────────────────────────────┴─────────────────────────┴─────────────────────────┴──────────────────┘

and I have just noticed that it converts them to UTC.

To do that we need to add a function like timedelta('nanos', time), looks very useful I'll add that.

fdncred commented 1 year ago

That sounds cool, except for the converting to UTC part. I'm particularly interested in the time column. What I set out to do today was to see if I can do what I do in nushell in dply. Here's what my output looks like, from this buildtimes file, in nushell.

❯ open ~\.local\share\nushell\startup-times.nuon | where build == release | 
group-by commit | 
transpose commit data | 
upsert perf {|r| if ($r.data | length) > 3 {$r.data.time | sort | range 0..(-3) | math avg} else {$r.data.time | math avg}} | 
upsert start_date {|s| $s.data.date | sort | math min | format date '%Y-%m-%d'} | 
upsert end_date {|e| $e.data.date | sort | math max | format date '%Y-%m-%d'} | 
upsert num_startups {|n| $n.data | length} | 
sort-by start_date
╭#─┬─────────────────commit─────────────────┬─────data──────┬──────perf───────┬start_date┬─end_date─┬num_startups╮
│0 │2bb0c1c618f961843b49432fb7a21304b41493af│[table 2 rows] │120ms 679µs 900ns│2023-07-03│2023-07-03│           2│
│1 │406b606398bf18c98063fbe998a4d27f75067eef│[table 40 rows]│126ms 134µs 792ns│2023-07-05│2023-07-07│          40│
│2 │8e38596bc9494357f01f166076e8d563f28016f3│[table 13 rows]│128ms 839µs 700ns│2023-07-07│2023-07-10│          13│
│3 │cf36f052c46b6efe57500e3acb7f52d2d0cb8d2e│[table 2 rows] │184ms 785µs 800ns│2023-07-12│2023-07-12│           2│
│4 │b2043135ed956ead0d3b5d5df49ea9d929dc7120│[table 2 rows] │ 136ms 68µs 100ns│2023-07-12│2023-07-13│           2│
│5 │4804e6a151ca0f212c3f4b097b4d805a69535149│[table 7 rows] │ 130ms 33µs 160ns│2023-07-14│2023-07-14│           7│
│6 │48271d8c3e1f83723f005ae1809ebd5026783f8a│[table 3 rows] │ 149ms 20µs 600ns│2023-07-17│2023-07-17│           3│
│7 │a5a79a7d95822bc143090612e1813f3b06befbf4│[table 10 rows]│139ms 845µs 700ns│2023-07-18│2023-07-20│          10│
│8 │9db0d6bd34a99805c6da296688aa186778be5a86│[table 3 rows] │138ms 853µs 133ns│2023-07-24│2023-07-24│           3│
│9 │208071916209af5a4159b131e438aa6cab524532│[table 5 rows] │123ms 224µs 733ns│2023-07-25│2023-07-25│           5│
│10│a33b5fe6ce97b5e9fe8a774c13e783ed65c1b591│[table 7 rows] │125ms 925µs 300ns│2023-07-25│2023-07-27│           7│
│11│f8d325dbfef5fec7ee109e37c624236998de8843│[table 4 rows] │108ms 199µs 800ns│2023-07-27│2023-07-27│           4│
│12│6aa30132aae188639a78ba8fd7feddc952d5792e│[table 8 rows] │112ms 912µs 100ns│2023-07-27│2023-07-27│           8│
│13│8403fff34500d30439545519c88c7d942c717e3e│[table 3 rows] │ 131ms 968µs 33ns│2023-07-27│2023-07-28│           3│
│14│94bec720791f716b44cb23db363f53e2fa7acce3│[table 21 rows]│118ms 872µs 289ns│2023-07-31│2023-08-01│          21│
│15│f6033ac5af75073dddce2400304448dbbadd0318│[table 2 rows] │155ms 370µs 400ns│2023-08-01│2023-08-01│           2│
│16│778a00efa10735e7eb368aea1ddfeb6af3d3720a│[table 4 rows] │126ms 261µs 100ns│2023-08-01│2023-08-02│           4│
│17│ec4941c8ac45f94ab408753b173ce991ce0fafd3│[table 1 row]  │121ms 661µs 700ns│2023-08-02│2023-08-02│           1│
│18│840b4b854b31e102cafd7667590626c4cfd0e19e│[table 7 rows] │139ms 761µs 300ns│2023-08-03│2023-08-04│           7│
│19│a9582e1c62f7e7eaec18c0302dbd699b2f164630│[table 3 rows] │176ms 377µs 433ns│2023-08-07│2023-08-07│           3│
│20│b1974fae399885ba93dda2b2a510ab585bda9f1a│[table 1 row]  │118ms 110µs 100ns│2023-08-07│2023-08-07│           1│
│21│56ed1eb8071658282ffe55b91d6dfc8ef958844c│[table 1 row]  │      152ms 756µs│2023-08-08│2023-08-08│           1│
│22│85c20350161b100abe14df9ef7b77337e1f76ec0│[table 1 row]  │158ms 963µs 300ns│2023-08-09│2023-08-09│           1│
│23│0674d4960b8b09073b31cca9e62eeeb1a1ddd4a8│[table 5 rows] │119ms 189µs 733ns│2023-08-09│2023-08-09│           5│
│24│6f5bd62a974c4b8660783b792b18c7597a805c1e│[table 3 rows] │122ms 395µs 366ns│2023-08-09│2023-08-09│           3│
│25│d5fa7b8a557ca0b0ca272cf82c66c2cc2e940f39│[table 2 rows] │160ms 789µs 100ns│2023-08-10│2023-08-10│           2│
│26│f9ffd9ae2917871092485f8c5bc393d6e96f7784│[table 7 rows] │125ms 912µs 440ns│2023-08-10│2023-08-11│           7│
│27│aa37572ddc79f5d2c040b8480ec9af6e5d1550d9│[table 11 rows]│139ms 547µs 844ns│2023-08-11│2023-08-14│          11│
│28│ad49c17ebacd04585fb4355e079ec87d7fc63d8d│[table 2 rows] │      134ms 946µs│2023-08-14│2023-08-14│           2│
╰#─┴─────────────────commit─────────────────┴─────data──────┴──────perf───────┴start_date┴─end_date─┴num_startups╯

I'm getting close but not quite there yet, I'm sure mostly because of my ignorance of dply.

❯ dply -c 'json("buildtimesnd.json") |
❯❯❯ mutate(date=dt(date)) |
❯❯❯ group_by(commit) |
❯❯❯ summarize(
❯❯❯   min_date = min(date),
❯❯❯   max_date = max(date),
❯❯❯   startups = n(),
❯❯❯   perf_datetime = mean(time)
❯❯❯ ) | arrange(min_date, max_date) |
❯❯❯ show()'
shape: (29, 5) elapsed: 0.008s
┌──────────────────────────────────────────┬─────────────────────────┬─────────────────────────┬────────────────┬──────────────────┐
│ commit                                   ┆ min_date                ┆ max_date                ┆ startups       ┆ perf_datetime    │
│ ---                                      ┆ ---                     ┆ ---                     ┆ ---            ┆ ---              │
│ str                                      ┆ datetime[ms]            ┆ datetime[ms]            ┆ i64            ┆ f64              │
╞══════════════════════════════════════════╪═════════════════════════╪═════════════════════════╪════════════════╪══════════════════╡
│ 2bb0c1c618f961843b49432fb7a21304b41493af ┆ 2023-07-03T17:03:08.605 ┆ 2023-07-03T17:45:46.859 ┆ 2              ┆ 120679900.0      │
│ 406b606398bf18c98063fbe998a4d27f75067eef ┆ 2023-07-05T12:41:06.186 ┆ 2023-07-07T13:31:08.454 ┆ 40             ┆ 129605145.0      │
│ 8e38596bc9494357f01f166076e8d563f28016f3 ┆ 2023-07-07T13:35:20.692 ┆ 2023-07-10T15:42:42.746 ┆ 13             ┆ 132144692.307692 │
│ cf36f052c46b6efe57500e3acb7f52d2d0cb8d2e ┆ 2023-07-12T15:10:55.382 ┆ 2023-07-12T19:23:46.223 ┆ 2              ┆ 184785800.0      │
│ b2043135ed956ead0d3b5d5df49ea9d929dc7120 ┆ 2023-07-12T20:56:26.198 ┆ 2023-07-13T20:14:08.484 ┆ 2              ┆ 136068100.0      │
│ 4804e6a151ca0f212c3f4b097b4d805a69535149 ┆ 2023-07-14T16:39:30.111 ┆ 2023-07-14T20:25:04.629 ┆ 7              ┆ 144062385.714286 │
│ 48271d8c3e1f83723f005ae1809ebd5026783f8a ┆ 2023-07-17T13:09:13.074 ┆ 2023-07-17T19:20:06.502 ┆ 3              ┆ 149020600.0      │
│ a5a79a7d95822bc143090612e1813f3b06befbf4 ┆ 2023-07-18T16:44:35.140 ┆ 2023-07-20T16:01:20.985 ┆ 10             ┆ 159522440.0      │
│ 9db0d6bd34a99805c6da296688aa186778be5a86 ┆ 2023-07-24T12:57:55.020 ┆ 2023-07-24T13:15:05.868 ┆ 3              ┆ 138853133.333333 │
│ 208071916209af5a4159b131e438aa6cab524532 ┆ 2023-07-25T12:22:59.358 ┆ 2023-07-25T17:46:46.435 ┆ 5              ┆ 231826620.0      │
│ a33b5fe6ce97b5e9fe8a774c13e783ed65c1b591 ┆ 2023-07-25T20:39:13.457 ┆ 2023-07-27T14:53:35.224 ┆ 7              ┆ 200283471.428571 │
│ f8d325dbfef5fec7ee109e37c624236998de8843 ┆ 2023-07-27T15:05:15.536 ┆ 2023-07-27T15:26:56.575 ┆ 4              ┆ 114143575.0      │
│ 6aa30132aae188639a78ba8fd7feddc952d5792e ┆ 2023-07-27T15:56:29.689 ┆ 2023-07-27T18:06:09.548 ┆ 8              ┆ 119858087.5      │
│ 8403fff34500d30439545519c88c7d942c717e3e ┆ 2023-07-27T20:02:18.214 ┆ 2023-07-28T14:01:34.190 ┆ 3              ┆ 131968033.333333 │
│ 94bec720791f716b44cb23db363f53e2fa7acce3 ┆ 2023-07-31T13:00:49.600 ┆ 2023-08-01T13:41:42.960 ┆ 21             ┆ 123044576.190476 │
│ f6033ac5af75073dddce2400304448dbbadd0318 ┆ 2023-08-01T14:35:39.008 ┆ 2023-08-01T18:00:31.016 ┆ 2              ┆ 155370400.0      │
│ 778a00efa10735e7eb368aea1ddfeb6af3d3720a ┆ 2023-08-01T20:38:56.294 ┆ 2023-08-02T15:11:37.531 ┆ 4              ┆ 152309825.0      │
│ ec4941c8ac45f94ab408753b173ce991ce0fafd3 ┆ 2023-08-02T16:05:31.403 ┆ 2023-08-02T16:05:31.403 ┆ 1              ┆ 121661700.0      │
│ 840b4b854b31e102cafd7667590626c4cfd0e19e ┆ 2023-08-03T12:27:50.091 ┆ 2023-08-04T18:56:57.316 ┆ 7              ┆ 149403514.285714 │
│ a9582e1c62f7e7eaec18c0302dbd699b2f164630 ┆ 2023-08-07T12:24:22.177 ┆ 2023-08-07T17:23:04.703 ┆ 3              ┆ 176377433.333333 │
│ b1974fae399885ba93dda2b2a510ab585bda9f1a ┆ 2023-08-07T17:27:28.423 ┆ 2023-08-07T17:27:28.423 ┆ 1              ┆ 118110100.0      │
│ 56ed1eb8071658282ffe55b91d6dfc8ef958844c ┆ 2023-08-08T12:16:13.356 ┆ 2023-08-08T12:16:13.356 ┆ 1              ┆ 152756000.0      │
│ 85c20350161b100abe14df9ef7b77337e1f76ec0 ┆ 2023-08-09T12:04:46.705 ┆ 2023-08-09T12:04:46.705 ┆ 1              ┆ 158963300.0      │
│ 0674d4960b8b09073b31cca9e62eeeb1a1ddd4a8 ┆ 2023-08-09T13:21:52.946 ┆ 2023-08-09T17:22:13.516 ┆ 5              ┆ 131798320.0      │
│ 6f5bd62a974c4b8660783b792b18c7597a805c1e ┆ 2023-08-09T17:43:54.095 ┆ 2023-08-09T19:50:25.802 ┆ 3              ┆ 122395366.666667 │
│ d5fa7b8a557ca0b0ca272cf82c66c2cc2e940f39 ┆ 2023-08-10T12:17:21.040 ┆ 2023-08-10T13:24:30.883 ┆ 2              ┆ 160789100.0      │
│ f9ffd9ae2917871092485f8c5bc393d6e96f7784 ┆ 2023-08-10T17:08:01.088 ┆ 2023-08-11T13:20:54.459 ┆ 7              ┆ 144515628.571429 │
│ aa37572ddc79f5d2c040b8480ec9af6e5d1550d9 ┆ 2023-08-11T13:52:12.318 ┆ 2023-08-14T13:11:12.804 ┆ 11             ┆ 145850981.818182 │
│ ad49c17ebacd04585fb4355e079ec87d7fc63d8d ┆ 2023-08-14T14:12:53.863 ┆ 2023-08-14T14:12:53.863 ┆ 1              ┆ 125579000.0      │
└──────────────────────────────────────────┴─────────────────────────┴─────────────────────────┴────────────────┴──────────────────┘
vincev commented 1 year ago

Wow that's amazing, nushell is very powerful, I am definitely going to play with it and learn more about it, I think dply is probably too basic for what you need to do.

fdncred commented 1 year ago

I love dply, and I'm trying to learn about it more so I can integrate it into nushell. I think it would be awesome to have a datafusion set of commands behind a feature. I love what you've done with it and I'm especially envious of group_by() | summarize().

vincev commented 1 year ago

Thank you @fdncred for suggesting this improvement, I have added a few time conversion functions so now it is possible to convert numbers to durations and durations to numbers (following lubridate R package):

〉json("buildtimes.json") |
:::     mutate(date=ymd_hms(date)) |
:::     group_by(commit) |
:::     summarize(
:::         min_date = min(date),
:::         max_date = max(date),
:::         startups = n(),
:::         perf_datetime = mean(time)
:::     ) | 
:::     mutate(perf_datetime=dnanos(perf_datetime)) |
:::     arrange(min_date, max_date) |
:::     relocate(perf_datetime) |
:::     head(5)
::: 
shape: (5, 5) elapsed: 0.007s
┌────────────────┬──────────────────────────────────────────┬─────────────────────────┬─────────────────────────┬────────────────┐
│ perf_datetime  ┆ commit                                   ┆ min_date                ┆ max_date                ┆ startups       │
│ ---            ┆ ---                                      ┆ ---                     ┆ ---                     ┆ ---            │
│ duration[ns]   ┆ str                                      ┆ datetime[ms]            ┆ datetime[ms]            ┆ i64            │
╞════════════════╪══════════════════════════════════════════╪═════════════════════════╪═════════════════════════╪════════════════╡
│ 0.120679900s   ┆ 2bb0c1c618f961843b49432fb7a21304b41493af ┆ 2023-07-03T17:03:08.605 ┆ 2023-07-03T17:45:46.859 ┆ 2              │
│ 0.129605145s   ┆ 406b606398bf18c98063fbe998a4d27f75067eef ┆ 2023-07-05T12:41:06.186 ┆ 2023-07-07T13:31:08.454 ┆ 40             │
│ 0.132144692s   ┆ 8e38596bc9494357f01f166076e8d563f28016f3 ┆ 2023-07-07T13:35:20.692 ┆ 2023-07-10T15:42:42.746 ┆ 13             │
│ 0.184785800s   ┆ cf36f052c46b6efe57500e3acb7f52d2d0cb8d2e ┆ 2023-07-12T15:10:55.382 ┆ 2023-07-12T19:23:46.223 ┆ 2              │
│ 0.136068100s   ┆ b2043135ed956ead0d3b5d5df49ea9d929dc7120 ┆ 2023-07-12T20:56:26.198 ┆ 2023-07-13T20:14:08.484 ┆ 2              │
└────────────────┴──────────────────────────────────────────┴─────────────────────────┴─────────────────────────┴────────────────┘

It is not as nice as Nushell but it is an improvement :smile: