Julia-side data processing?

tkf commented 4 years ago

I think it'd be nice if we can do some data processing on Julia side. It would be implemented as spec-to-spec conversion. For example, we can do histogram on Julia side and send binned data to Vega-Lite. Getting binning logic exactly equivalent to the one implemented in Vega-Lite is likely very hard but I think it's OK as long as this is exposed as an explicit opt-in feature. For example, this would be beneficial if the input data is large (e.g., don't fit in the memory) or represented as "lazy table" (e.g., output of Query.jl etc.).

Unfortunately, current internal of VegaLite.jl does not allow this because it eagerly materializes the table. It'd be nice if the data property is untouched until the rendering time.

davidanthoff commented 4 years ago

Unfortunately, current internal of VegaLite.jl does not allow this because it eagerly materializes the table. It'd be nice if the data property is untouched until the rendering time.

That is actually no longer correct, the current implementation represents data as a DataValuesNode that only gets iterated once the spec is converted to JSON, i.e. what you propose here :)

We should look more carefully into this, I agree. The vega team has worked a fair bit on ideas where the processing is pushed out of the javascript as well, I think some repos around that are https://github.com/vega/scalable-vega, https://github.com/omnisci/vega-transform-omnisci-core, https://github.com/vega/vg-transforms2sql, https://github.com/vega/vega-lite-transforms2sql, https://github.com/uwdata/falcon.

tkf commented 4 years ago

IIUC you still convert the data to columns, right?

https://github.com/queryverse/VegaLite.jl/blob/c66278ce2010462c3864f8dfd2264ed6a44c536a/src/spec_utils.jl#L48-L49

This does not seem to cover the representations such as iterator-of-rows or Transducers.eduction. OK, the latter is probably used only by me :smile: and I can imagine that iterator-of-rows can still expose non-copying column views. But this seems to be an extra layer of indirection that is hard for external spec-to-spec conversion tools to undo.

Why not just "process" the data property only at the render time and leave it as-is when the spec is constructed? (Though I think it's still a good idea to validate it whenever safe to do so.)

The vega team has worked a fair bit on ideas where the processing is pushed out of the javascript as well

Thanks a lot for sharing the links! Do you think it is possible to fuse the processing defined in Julia? I suppose you still need to materialize it to some in-memory shareable representation (e.g., Arrow)? Though maybe that's totally fine for many usecases.

davidanthoff commented 4 years ago

Ah, you're right, I still materialize the data at the beginning. I think mainly because there are many situations when a spec is turned into JSON multiple times right after another (e.g. in Jupyterlab) and so I didn't want to end up iterating the source twice or multiple times, in particular not with sources from Query.jl where iterating might be expensive. I guess we could move to something where the first conversion to JSON triggers the iteration and then the results get cached, but that seems quite a bit more involved...

Thanks a lot for sharing the links! Do you think it is possible to fuse the processing defined in Julia? I suppose you still need to materialize it to some in-memory shareable representation (e.g., Arrow)? Though maybe that's totally fine for many usecases.

Good question... I'm not sure. One tricky thing is that we have so many different clients and the situation is quite different for each of them (Node, Electron, VS Code, Jupyterlab) and sometimes it is all on one machine, but for other scenarios things even get split between machines.

Broadly my next hope/idea for large datasets had been that we would materialize it early into an arrow buffer and then maybe even try to do things like put that into a shared memory location so that we don't have to copy it between the Julia and Node process. That won't work with all clients, though... But of course the Arrow package right now is in the middle of a major rewrite. It has been for a long time, so not sure whether that will finish at some point.

tkf commented 4 years ago

I guess we could move to something where the first conversion to JSON triggers the iteration and then the results get cached, but that seems quite a bit more involved...

How about a simple approach like this?

struct DataValuesNode
    values::Any
    json::Vector{UInt8}
end

function JSON.Writer.show_json(io::JSON.Writer.SC, ::JSON.Writer.CS, d::DataValuesNode)
    if isempty(d.json)
        our_show_json(IOBuffer(d.json; write = true), d.values)
    end
    write(io, d.json)
end

Broadly my next hope/idea for large datasets had been that we would materialize it early into an arrow buffer and then maybe even try to do things like put that into a shared memory location so that we don't have to copy it between the Julia and Node process.

I agree this is a great direction! At the same time, I think it'd be also nice to expose more fundamental aspect of Vega-Lite; i.e., it's an "intermediate representation" for data visualization. Since a Vege-Lite spec is just another data, it can be manipulated within Julia, just like any other objects.

queryverse / VegaLite.jl

Julia-side data processing? #283