[Design Proposal] Vega/Vega-Lite as visualization dsl for Kubeflow metadata-ui

eterna2 commented 4 years ago

Background

Currently, kfp manages visualization through a collections of viewer components.

Ignoring viewers like markdown, html, tensorboard, etc, visualization in kfp can be separated into 2 groups:

pre-created vis by kfp developer with react-vis (usually via metadata-ui artifacts)
pre-defined and user-defined vis based on python (generated via jupyter notebook display function)

Proposal

Vega/Vega-Lite to be used as a visualization dsl for

specifying metadata-ui artifact, and
generating custom vis for data (that do not have a readily-available python lib for visualizing the data) - i.e. alternative to python custom vis.

Pros

language agnostic: Uses a JSON-based dsl to describe visualization

simple: Simple and concise grammar to generate most common visualization (esp Vega-Lite) Example: barchart

{
"$schema": "https://vega.github.io/schema/vega-lite/v4.json",
"description": "A simple bar chart with embedded data.",
"data": {
"values": [
  {"a": "A", "b": 28}, {"a": "B", "b": 55}, {"a": "C", "b": 43},
  {"a": "D", "b": 91}, {"a": "E", "b": 81}, {"a": "F", "b": 53},
  {"a": "G", "b": 19}, {"a": "H", "b": 87}, {"a": "I", "b": 52}
]
},
"mark": "bar",
"encoding": {
"x": {"field": "a", "type": "ordinal"},
"y": {"field": "b", "type": "quantitative"}
}
}

supports multiple data format: e.g. csv, tsv, geojson/topojson (for maps), json, etc
supports multiple/custom loader types: e.g. http request, inlined, data stream, etc
composable: Vega dsl is designed to be composable, which makes it easy to create visualization from existing vis components - i.e. easy to wrap the dsl as composable vis components in UI.
can be implemented entirely on client-side: Do not need to have any backend service to generate the html for the visualization

Cons

Non I can think of.

Concept Details

Propose for mlpipeline-ui-metadata.json artifact to support Vega and Vega-Lite:

i.e.

{
  "version": 1,
  "outputs": [
    {
      "type": "vega-lite",
      "data": { "my-matrix": "my-dir/my-matrix.csv" },  // data to be passed to vega spec to be rendered
      "spec": { ... },  // vega or vega-lite spec or reference pre-defined vega/vega-lite specs
    }
  ]
}

Replace existing vis component created with react-vis with react-vega:

It is easier to map existing ui-metadata schema to a vega-lite spec then generate the corresponding vis component, rather than implementing individual vis component, each time we need to support a new vis.

import { Vega } from 'react-vega';

export const RocCurve = props => {
  const spec = {...}  // pre-defined spec
  return <Vega spec={spec} data={props.data} />
}

Enhance custom vis creator to support Vega/Vega-Lite
- provides a simple UI to edit Vega/Vega-Lite spec and renders the vis (does not need a backend)

See https://vega.github.io/editor/#/examples/vega-lite/airport_connections

Bobgy commented 4 years ago

Thanks for the suggestion. This is a great idea! I have some concerns about integrating vega as a first party visualization:

How popular is it among data scientists?
How big is bundle size, do we need to include both vega and vega-lite?

Would it be enough if we provide some documentation of using them with embeded HTML? e.g. https://vega.github.io/vega-lite/usage/embed.html#start-using-vega-lite-with-vega-embed After my recent change of supporting inline HTML visualization in https://github.com/kubeflow/pipelines/pull/3177, I think it's fairly straight forward to generate a HTML using vega without any change from KFP. And we can also make a python wrapper for that, if there isn't already one.

eterna2 commented 4 years ago

How popular is it among data scientists?

Hard to quantify it's popularity as there are too many vis tools out there.

It is probably more an engineer tool rather than a data scientist tool - as in it is generally used more as a specification for vis (i.e. instead of saving png of the charts, you save the spec together with ur experiment metadata, params, dataset, etc).

For data scientist, they probably use an abstraction layer on top of it, e.g. altair or py-vega.

How big is bundle size, do we need to include both vega and vega-lite?

Fairly big as it is quite comprehensive. 165 kb for Vega. Not sure about Vega lite. U only need Vega lite to transpile Vega lite spec into Vega spec.

U probably can do tree shaking to remove features u don't intend to support.

If size is a concern, we can do server side rendering for the vis. Vega can output svg, png/jpeg, or data url.

Would it be enough if we provide some documentation of using them with embeded HTML?

That would work, but that adds overheads to data scientist. Or do u think it is a better solution for me to add a ui-metadata sdk?

Because one of the issue I have is that I always have to search for the format of ui-metadata and where to store it, and how to generate it for my kfp operator. In the ideal world, I would prefer a simple sdk to generate whatever vis I wants, without needing to know the actual io.

import kfp.dsl

from kfp.dsl.vis import ConfusionMatrix, WebVis

@kfp.dsl.pipeline()
def some_pipeline():
    op = some_op()
    conf_mat = ConfusionMatrix(..., source= op.outputs.data1)
    // Or
    op.add_vis(WebVis(html=some_html_creator_func))

It is probably not important enough a justification to switch to Vega unless kubeflow is going to provider a richer sets of vis.

But I like Vega particularly because the grammar is elegant and easy to remember. And switching between diff vis for same data is quite trivial - i.e. because it is composition rather than templates (unlike many other solutions, diff chart types have diff params - Vega has very good separation).

Tldr Essentially, I wanted to package vis artifacts as Vega spec together with data artifacts - i.e. vis should have it's own consistent specification, and shld be stored just like data artifacts.

And the front end can render vis artifacts as is, w/o much additional work.

And these vis artifacts can be used at different parts of kubeflow or other apps because Vega spec can be used as a common standard for vis artifacts - i.e. it is easy to render Vega charts with provided spec.

Currently, there is no consistent std for vis in kubeflow as there is a mix of solutions - from dynamically generated py vis, to custom format for specific vis (e.g. roc, confusion matrix, etc), to html web app.

Alternatively, we can consider a separate vis service for kubeflow with its own crd - generates the required vis from a rest or grpc service.

eterna2 commented 4 years ago

Something like this

I have a simple cloud function to render my chart (that takes a data from a http source) as a png.

Or as a web app link Link

Bobgy commented 4 years ago

I think my main argument is that, vega support is a feature that can be made convenient completely by 3rd party library/components, so I'm not seeing strong enough reason to integrate it in KFP system.

Because there are also other visualization libraries and new libraries coming out.

The only exception is -- If we re-implement or introduce new first-party visualizations using vega directly, then it's probably worth it to support vega json spec directly though.

eterna2 commented 4 years ago

yeah, i agree with you on that. I probably can build it as an extension/plugin outside of kfp.

But what do u think about my suggestion of adding a vis sdk in the kfp dsl? Not about vega but more on my pain with /mlpipeline-ui-metadata.json.

Cuz the biggest pain point for me when creating kfp ops is remember the path and the format (i.e. how to populate this json).

I am proposing to add a kfp.dsl.vis module which can either:

append a vis to the op inside a pipeline, or
or serve as a lib like tf.io to generate the /mlpipeline-ui-metadata.json inside the op itself.

Maybe I will do an actual mvp as a kfp.contrib to demonstrate my idea.

Bobgy commented 4 years ago

@eterna2 I'm no expert on sdk, but personally it also took me some hard time to figure out how to write metadata with sdk, so I'd prefer sdk have builtin support.

/cc @Ark-kun /cc @hongye-sun /cc @numerology for sdk related proposal

eterna2 commented 4 years ago

Ok, I created a kfx package at https://github.com/e2fyi/kfx/ to demonstrate my idea.

It works now, although I feel it is abit convoluted.

In this example, I am using ArtfactLocationHelper to modify the kfp task with env variables that contains metadata about the Argo configs (needs to be inputted by user)

then I use KfpArtifact to retrieve these metadata to generate the url (to be used as source for mlpipeline-ui-metadata) to the artifact as well as the API call to UI to get the artifact (for data loading in Vega).

This is bad mostly because it relies on the user to know the Argo configmap and to set it.

I would prefer the UI artifact API to be able to support workflow.name or some identifier - i.e. instead of just source, bucket and key, we can support workflow.name + artifact name, where these can be used to retrieve the necessary info to get the artifact (similar to what I did to get the pod logs from Argo artifactory).

This removes the need for user to know anything about Argo. And we can meta-declare a source or url to be an artifact generated by kfp tasks.

import kfp.components
import kfp.dsl
import kfx.dsl

# creates the helper that has the argo configs (tells you how artifacts will be stored)
# see https://github.com/argoproj/argo/blob/master/docs/workflow-controller-configmap.yaml
helper = kfx.dsl.ArtifactLocationHelper(
    scheme="minio", bucket="mlpipeline", key_prefix="artifacts/"
)

@kfp.components.func_to_container_op
def test_op(
    mlpipeline_ui_metadata: OutputTextFile(str), markdown_data_file: OutputTextFile(str)
):
    "A test kubeflow pipeline task."

    import json

    import kfx.dsl
    import kfx.vis
    import kfx.vis.vega

    data = [
        {"a": "A", "b": 28},
        {"a": "B", "b": 55},
        {"a": "C", "b": 43},
        {"a": "D", "b": 91},
        {"a": "E", "b": 81},
        {"a": "F", "b": 53},
        {"a": "G", "b": 19},
        {"a": "H", "b": 87},
        {"a": "I", "b": 52},
    ]
    vega_data_file.write(json.dumps(data))

    # `KfpArtifact` provides the reference to data artifact created
    # inside this task
    spec = {
        "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
        "description": "A simple bar chart",
        "data": {
            "url": kfx.dsl.KfpArtifact("vega_data_file"),
            "format": {"type": "json"},
        },
        "mark": "bar",
        "encoding": {
            "x": {"field": "a", "type": "ordinal"},
            "y": {"field": "b", "type": "quantitative"},
        },
    }

    # write the markdown to the `markdown-data` artifact
    markdown_data_file.write("### hello world")

    # creates an ui metadata object
    ui_metadata = kfx.vis.kfp_ui_metadata(
        # Describes the vis to generate in the kubeflow pipeline UI.
        [
            # markdown vis from a markdown artifact.
            # `KfpArtifact` provides the reference to data artifact created
            # inside this task
            kfx.vis.markdown(kfx.dsl.KfpArtifact("markdown_data_file")),
            # a vega web app from the vega data artifact.
            kfx.vis.vega.vega_web_app(spec),
        ]
    )

    # writes the ui metadata object as the `mlpipeline-ui-metadata` artifact
    mlpipeline_ui_metadata.write(kfx.vis.asjson(ui_metadata))

    # prints the uri to the markdown artifact
    print(ui_metadata.outputs[0].source)

@kfp.dsl.pipeline()
def test_pipeline():
    "A test kubeflow pipeline"

    op: kfp.dsl.ContainerOp = test_op()

    # modify kfp operator with artifact location metadata through env vars
    op.apply(helper.set_envs())

eterna2 commented 4 years ago

I also written the pydantic data models for the mlpipeline-ui-metadata and generated the corresponding json scheme for the file.

Bobgy commented 4 years ago

@eterna2 Looks great! A quick question: is it a requirement to store visualization data in external source? If you just store it inline inside mlpipeline-ui-metadata, then user doesn't need to know so much other context.

I guess you have good reasons to do so, just wanting to know.

eterna2 commented 4 years ago

Depends on the data size. For small dataset, we probably can inline.

Becuz my prev use cases are mostly geospatial simulation which generates quite a bit of logs.

And usually we want to store these logs separately.

eterna2 commented 4 years ago

But I agree that inline shld solve 90% of the use cases. And probably a better solution. Did not think of that.

Probably, I can generate multiple "baked" vis, separately from the logs.

eterna2 commented 4 years ago

In this case, I probably can provide helper classes to convert sklearn confusion matrix etc into inline UI metadata.

Bobgy commented 4 years ago

I see, external source is definitely needed if data size is huge or ACL is needed on the data. That seems like complexity unrelated to KFP. A helper sdk will be helpful in this case.

Also for inline cases, helper is as good as 1st party integrations.

eterna2 commented 4 years ago

I encountered some issues getting my artifacts to work inside the iframe as the iframe does not provide allow-same-origin permission - request origin = null.

Aka I can't make any request to the node server.

The only 3 solutions I see are:

only support inlined data
add CORs flag to the node server (as a param)
add allow-same-origin permission to the iframe.

Both seems equally risky from security pov.

Bobgy commented 4 years ago

Yes, the visualization is expected to only access inline data and open data for security reasons.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

kubeflow / pipelines