observablehq / framework

A static site generator for data apps, dashboards, reports, and more. Observable Framework combines JavaScript on the front-end for interactive graphics with any language on the back-end for data analysis.
https://observablehq.com/framework/
ISC License
2.33k stars 101 forks source link

Loading parquet file gives "RuntimeError: unreachable executed" #873

Closed llimllib closed 6 months ago

llimllib commented 6 months ago

As far as I can tell, the parquet file is valid; unfortunately I can't post the whole thing because it's derived from logs at work.

This is the code to try and load the file:

```js echo
const stats = FileAttachment("log.parquet").parquet();
display(Inputs.table(stats));

The full traceback from the console:

<details>

RuntimeError: unreachable executed RuntimeError http://127.0.0.1:3000/_observablehq/runtime.js:2 value http://127.0.0.1:3000/_observablehq/runtime.js:2 promise callbackmr http://127.0.0.1:3000/_observablehq/runtime.js:2 wr http://127.0.0.1:3000/_observablehq/runtime.js:2 value http://127.0.0.1:3000/_observablehq/runtime.js:2 value http://127.0.0.1:3000/_observablehq/runtime.js:2 promise callbackvalue http://127.0.0.1:3000/_observablehq/runtime.js:2 value http://127.0.0.1:3000/_observablehq/runtime.js:2 ar http://127.0.0.1:3000/_observablehq/runtime.js:2 value http://127.0.0.1:3000/_observablehq/runtime.js:2 onmessage http://127.0.0.1:3000/_observablehq/client.js:355 onmessage http://127.0.0.1:3000/_observablehq/client.js:355 open http://127.0.0.1:3000/_observablehq/client.js:285

http://127.0.0.1:3000/logs:17 client.js:146:48 ``` - The `parquet` CLI program reports that the file is normal: ``` $ parquet check-stats log.parquet log.parquet has no corrupt stats ``` - `duckdb` can load the file without an issue:
``` D select * from 'log.parquet' order by timestamp desc limit 10; 100% ▕████████████████████████████████████████████████████████████▏ ┌────────────────────────────────┬──────────────────────────┬──────────┐ │ path │ timestamp │ duration │ │ varchar │ varchar │ int64 │ ├────────────────────────────────┼──────────────────────────┼──────────┤ │ / │ 2024-02-18T23:59:59.000Z │ 138 │ │ /v1/request │ 2024-02-18T23:59:59.000Z │ 94 │ │ /pageview │ 2024-02-18T23:59:59.000Z │ 358 │ │ /v1/request │ 2024-02-18T23:59:59.000Z │ 84 │ │ /wiz-docs/docs/getting-started │ 2024-02-18T23:59:59.000Z │ 1622 │ │ / │ 2024-02-18T23:59:59.000Z │ 120 │ │ / │ 2024-02-18T23:59:59.000Z │ 161 │ │ / │ 2024-02-18T23:59:59.000Z │ 101 │ │ / │ 2024-02-18T23:59:59.000Z │ 242 │ │ / │ 2024-02-18T23:59:59.000Z │ 180 │ ├────────────────────────────────┴──────────────────────────┴──────────┤ │ 10 rows 3 columns │ └──────────────────────────────────────────────────────────────────────┘ D create table logs as select * from 'log.parquet'; 100% ▕████████████████████████████████████████████████████████████▏ D select max(duration) from logs; ┌───────────────┐ │ max(duration) │ │ int64 │ ├───────────────┤ │ 461646 │ └───────────────┘ ```
The file is large: 270mb. Is that just too big for javascript in my browser? Any further way I can debug what's going on here?
Fil commented 6 months ago

Can you try to create a fake dataset that reproduces the bug? Maybe by setting all fields to random gibberish? Otherwise this issue is not actionable.

espinielli commented 6 months ago

I am curious too to see a reproducible example. I tried to output a simple dataset in parquet in R but failed miserably...I haven't been able to find any docs/examples that make it possible to spit to stdout. My naive reproducible example for a log.parquet.R data loader would have been:

library(dplyr)
library(arrow)

tribble(
  ~colA, ~colB,
  "a",   1,
  "b",   2
  "c",   3
) |> 
  write_parquet(stdout())

but it failed with

Error in seek.connection(1L) : 'seek' not enabled for this connection
Error in close.connection(1L) : cannot close standard connections
Fil commented 6 months ago

@espinielli it looks like write_parquet expects to write to disk, not to stdout which doesn't allow arbitrary seeks. So you might want to write to a temp file on disk, then cat the temp file to stdout.

espinielli commented 6 months ago

@Fil yes that is what I will try:

  1. Write to a temporary file
  2. Dump to stdout
Fil commented 6 months ago

@espinielli for inspiration you can take a look at @allisonhorst's (work in progress) R data loader at https://github.com/observablehq/framework/pull/749/files#diff-2ef0831543dc2337a63f37ab4e629cc287b3e09b9ba63cabf788d09b29536bda

llimllib commented 6 months ago

Unfortunately the arrow wasm library says that there's a function to print a useful error message, but that it's disabled to reduce bundle size.

I'm still trying to narrow this down... I have generated a log file of similar size & composition that works, but I get this error reproducibly on my own file.

The script I used to generate a file that does not fail, and is 256mb on disk: ```python import random # pip install mimesis pandas from mimesis import Datetime, Path import pandas as pd path = Path("linux") dt = Datetime() # 10k records is 320k on disk # 10m records is 256mb on disk n = 10_000_000 df = pd.DataFrame( { "path": (path.project_dir() for _ in range(n)), "time": (dt.timestamp() for _ in range(n)), "duration": (random.randint(1, 10000) for _ in range(n)), } ) df.to_parquet("fake.parquet") ```
llimllib commented 6 months ago

I'm pretty certain that size of file is the issue here, based on:

So this bug really comes down to: it seems like the parquet file library framework relies on has really terrible error messaging when the file it tries to load is too large.

I'll leave it to you all if you think it's worth doing something about that, and I don't mean that passive aggressively: it may just not be worth fixing. It is kind of a poor user experience though.

Fil commented 6 months ago

I have an example where the errors I get (with duckdb-wasm) are:

RuntimeError: Invalid Error: don't know what type: 

and

RuntimeError: Invalid Error: Variable-length int over 10 bytes.

So yeah, it's hard to tell what's happening. From this repo, we can only hope that the duckdb/parquet/arrow tooling becomes better. But it might help to open a relevant issue with them once we close in on what causes it.

mrppdex commented 5 months ago

I came across RuntimeError: unreachable executed message. I'm not sure why transforming my data to df <- df[1:nrow(df),] helped, but it did.

library(arrow)
library(clinicalfd)

df <- clinicalfd::adsl
df <- df[1:nrow(df),]

# Write the data frame to a temporary Parquet file
temp_file <- tempfile(fileext = ".parquet")
arrow::write_parquet(df, sink = temp_file)

system2('/bin/cat', args = temp_file)
espinielli commented 5 months ago

In an interactive session df <- clinicalfd::adsl resolves directly to a data frame. My guess is that in a non interactive session you instead get a Promise which is not yet resolved when you save the parquet file. Probably substituting

df <- df[1:nrow(df),]

with

rlang::eval_tidy(df)

would do...

Anyway, I would access the data from a package with utils::data() so your code would be

# you need to have installed `rlang` (and `sas2r/clinical_fd`)
library(arrow)

utils::data(adsl, package = "clinicalfd")   # you get `adsl` as a Promise object
rlang::eval_tidy(adsl)                      # resolve the Promise, now `adsl` is a data frame

# Write the data frame to a temporary Parquet file
temp_file <- tempfile(fileext = ".parquet")
arrow::write_parquet(adsl, sink = temp_file)

system2('/bin/cat', args = temp_file)
mrppdex commented 5 months ago

I tried both

rlang::eval_tidy(df)
# AND
force(df)

And it didn't help. I ran all_equal(df, df[1:nrow(df),] and there are differences in attributes between both data frames. Original has attribute 'label' associated with column names, and transformed has them stripped off.

ORIGINAL $ mmsetot : 'labelled' int 23 23 23 23 21 23 10 23 20 20 ... ..- attr(*, "label")= chr "MMSE Total"

TRANSFORMED $ mmsetot : int 23 23 23 23 21 23 10 23 20 20 ...

It looks like there's an issue with loading parquet data with labelled column names.

llimllib commented 5 months ago

@mrppdex you may want to file an issue on the parquet-wasm repo if you have a parquet file that reproducibly causes the issue