Chaining data loaders - Githubissues

Fil commented 7 months ago

Suppose we want to 1. download a dataset from an API and 2. analyze it. Currently a data loader must do both at the same time, and will run the download part again if we update the analysis code.

Ideally we'd want to separate this into two chained data loaders, that would still be live, i.e. if a page relies on 2 that relies on 1, editing 1 would tell the page to require the analysis again, which would trigger a new download. But editing 2 would only run the analysis again, not the download.

This would also make it easier to generate several files from a common (and slow) download.

mbostock commented 7 months ago

From https://github.com/observablehq/cli/pull/325#issuecomment-1842916117:

Tthe API for chaining loaders isn’t really about FileAttachment since data loaders can be written in any language. Instead a data loader needs to be able to fetch a file from the preview server (and we need an equivalent server during build). Maybe we set an environment variable which data loaders can read to know the address of the preview server. We’d also need to detect and error on circular dependencies ideally.

mythmon commented 5 months ago

I've been working on a small project to get some first hand experience with the CLI. In it I'm downloading a zip file from some hobbyist website, extracting couple dozen text files, and then running them through a custom parser. Parsing them takes about 30 seconds, which is a bit longer than I want to do in the markdown file. I'm iterating on the parser itself, so I'm re-running it every few minutes.

The options I see for my case are:

Do everything in one data loader, and naively download the zip every time. That seems rude to the small site.
Do everything in one data loader, and implement my own caching functionality. This seems tedious and exactly what I'd want a data loader to do for me.
Run the parser in the client. This feels slow, though it is nice for development to see the data load in as it generates.

From this, I have two wish list items. One is chained data loaders, the other is incremental data loaders that can somehow stream their results in to the client. I don't really know how that would work, and it's probably better suited for Notebooks anyways.

I can sympathize with wanting data loaders to be in any language, but it is very jarring to go from writing my code in a JS fenced code block script and having it work easily, to working in a .json.js data loader and suddenly all of my imports break and I lose all of the nice tools I was using a moment ago. It makes me feel like to properly use Observable CLI I need to be fluent in three varieties of JS: Markdown code blocks, browser imports, and file attachments.

Perhaps once we have the server-based dataloader workflow that Mike mentioned, we could then wrap that in a FileAttachment facade that makes it feel just like it does in Markdown files.

Fil commented 5 months ago

@mythmon FileAttachment supports streaming, see https://observablehq.com/@observablehq/streaming-shapefiles

trebor commented 5 months ago

an example of the use case in the chess bump chart example. any changes to the data transformation in the data loader require a full download of all of the data.

espinielli commented 4 months ago

This seems to point to something similar to having a dependency graph, like in the targets 📦 in R. And the dependency is not only for the data loaders but also for assets (computational cells are already covered, aren't they?)

Fil commented 4 months ago

Tangentially related to #918.

palewire commented 2 weeks ago

an example of the use case in the chess bump chart example. any changes to the data transformation in the data loader require a full download of all of the data.

I'm not seeing this implemented in the chess bump example. Am I missing something?

mythmon commented 2 weeks ago

It's not implemented in the chess bump example. The example is a case where implementing this feature would improve the data loader(s), if Framework gained this features.

palewire commented 2 weeks ago

Gotcha. Here's my use case, for anyone interested.

I'd like Data Loader 1 to be a Python script that downloads a dataframe from s3, transforms the data with filter-y tricks and then writes out a JSON file that's ready to serve.

Then Data Loader 2 would be a Node.JS script that would open that very large JSON file, build a D3 graphic in a canvas object, and then write out a PNG file that could be ultimately served by the static site.

observablehq / framework

Chaining data loaders #332