observablehq / framework

A static site generator for data apps, dashboards, reports, and more. Observable Framework combines JavaScript on the front-end for interactive graphics with any language on the back-end for data analysis.
https://observablehq.com/framework/
ISC License
2.13k stars 85 forks source link

"No results" after refreshing site when using some parquet files #1470

Closed brichards920 closed 1 week ago

brichards920 commented 1 week ago

When I write queries against parquet files tagged in the 'sql' preamble, I'm finding that they'll work as I'm starting to develop a report, but as I continue to iterate, they'll stop returning results entirely.

It only seems to happen with medium+ sized files. I don't really see it with parquet files with only 100 or so rows, but after seeing it in some of my actual datasets I created a dummy dataset with 10,000 rows and just over 100 columns of dates, strings, and numbers, I was able to reproduce it with the dummy dataset too.

I don't know 100%, but it seems like maybe it's related to the browser's cache? If I open up a private browsing window and a normal window, the private window seems OK while the normal window starts to fail.

Here's the markdown file that corresponds to the situation that causes a failure for me. It's not typical for me to display 100+ columns in an Observable table, but I was seeing this error with smaller tables and I was trying to force it here.


---
theme: dashboard
title: Example dashboard
toc: false
sql:
  dummy: ./data/dummy_data.parquet
---

```sql id=d1
select df_id::TEXT, num1, num5, str3 from dummy
select df_id, int1, int2, int3, num1, num2, num3, num4, num5, str1, str2, str3, str4, str5, dt1, dt2, dt3, dt4, 
    dt5, num6, num7, num8, num9, num10, num11, num12, num13, num14, num15, num16, num17, num18, num19, num20, 
    num21, num22, num23, num24, num25, num26, num27, num28, num29, num30, num31, num32, num33, num34, num35, 
    num36, num37, num38, num39, num40, num41, num42, num43, num44, num45, num46, num47, num48, num49, num50, 
    num51, num52, num53, num54, num55, num56, num57, num58, num59, num60, num61, num62, num63, num64, num65, 
    num66, num67, num68, num69, num70, num71, num72, num73, num74, num75, num76, num77, num78, num79, num80, 
    num81, num82, num83, num84, num85, num86, num87, num88, num89, num90, num91, num92, num93, num94, num95, 
    num96, num97, num98, num99, num100 from dummy
${Inputs.table(d1)}
${Inputs.table(d2)}


I haven't attached the parquet file since the file type doesn't seem to be supported for upload here. If there are any best practices you'd suggest for the parquet file that I should be using, like what version, compression, and so on, I can try that. I created this with a basic "write_parquet(dummyFrame, 'dummy_data.parquet')" call in R.

I don't think I experience the issue when I use .arrow files, so I have a workaround. But I wanted to check in here in case there's anything I can do to guarantee that the parquet files will continue to load after multiple refreshes. 
mbostock commented 1 week ago

Are you seeing any errors or failed requests in either the preview server console, or in your browser’s console?

brichards920 commented 1 week ago

I'm not getting any errors or anything in the console. I did notice that for the dataset that isn't loading, after clearing the cache and restarting the server, I'm seeing it send more GET requests to the problematic file the first time and fewer the second time, when the issue starts to present. I switched the hash to ellipses to save space below.

Load 1: HEAD /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=...

Load 2 (fewer calls to problematic.parquet): HEAD /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=... GET /_file/data/problematic.parquet?sha=...

I'll continue to look into this to see if I can create a fully reproducible example. I've tried subsetting columns and rows of the files with issue to see if I can identify a specific column type or value in the data that is causing issues, but nothing has jumped out so far.

brichards920 commented 1 week ago

I kept checking into this to find a minimally reproducible example. I've tested the process below on two computers just to be sure. I grabbed the 'weather.parquet' file from this page:

https://observablehq.com/@cmudig/duckdb-client

And I'm using this markdown file:

---
theme: dashboard
title: Example dashboard
toc: false
sql:
  w: ./data/weather.parquet
---

```sql id=tbl2 display
select * from w limit 1

If I start the server with npm run dev, the page loads and everything looks great. 

![image](https://github.com/observablehq/framework/assets/10135679/932f0a7e-cc57-4430-bbe9-ab4e5cc7ca23)

Then if I adjust the query to:

select * from w limit 2



and save, the page live reloads and now I see two rows as expected. 

![image](https://github.com/observablehq/framework/assets/10135679/251b29e2-e822-4c5f-a8b5-5e485636dbfa)

However, then if I manually refresh the browser using ctrl+r, the table returns to "no results" and fails until I clear the cache. 

![image](https://github.com/observablehq/framework/assets/10135679/f62cc16e-5078-43cb-ac60-ad2f318bf722)

In addition, any query I run against the 'weather' dataset now fails with "no results" until I clear the cache - even restarting the server doesn't help. So it seems like my manual refresh might be the issue. 

I'm coming to framework having used Observable in the Quarto implementation, so I'm probably just refreshing the page out of habit. Is this a bug, or should I just 100% avoid using anything but the live preview?
mbostock commented 1 week ago

I’m afraid I’m not able to reproduce.

https://github.com/observablehq/framework/assets/230541/324490bd-1689-4347-9040-e1ffd1e674d1

There shouldn’t be any issue with manually reloading the page. The fact that clearing the cache fixes the issue suggests that this may be a browser issue. If you’re willing, you could try updating or re-installing your browser, resetting your browser settings, uninstalling any extensions, or trying a different browser.

Since I can’t reproduce and can’t investigate further, I’m going to close this issue. But if you have any other hints, I’d be happy to take another look.

brichards920 commented 1 week ago

Thanks for your help and for taking a look. Here's a similar setup on my screen with Chrome, Chrome in Incognito Mode, and Edge so you can see it live.

https://github.com/observablehq/framework/assets/10135679/86ab9985-6668-443b-b16c-1b171b5751a6

mbostock commented 1 week ago

Browser extensions are disabled in incognito tabs, so the fact that it works in the incognito tab suggests that a browser extension may be at fault. Did you try disabling browser extensions as I mentioned?

brichards920 commented 1 week ago

I just went through and double checked and reset both Chrome and Edge to their default settings and removed any extensions and unfortunately the problem persists.

I tried Firefox and oddly it does not seem to have the same problem.

Thanks again for checking into it. If I use arrow files instead the issue goes away, so I can continue to use framework.

mbostock commented 1 week ago

Sorry for the trouble, and thanks for trying! I’ll report back here if we learn anything on this problem.

brichards920 commented 4 days ago

I'm just making a quick update here to note that the issue is likely related to duckdb-wasm:

https://github.com/duckdb/duckdb-wasm/issues/1658

Users noted there as well that the issue could not be reproduced on a Mac.