Closed eterna2 closed 4 years ago
Thanks for the suggestion. This is a great idea! I have some concerns about integrating vega as a first party visualization:
Would it be enough if we provide some documentation of using them with embeded HTML? e.g. https://vega.github.io/vega-lite/usage/embed.html#start-using-vega-lite-with-vega-embed After my recent change of supporting inline HTML visualization in https://github.com/kubeflow/pipelines/pull/3177, I think it's fairly straight forward to generate a HTML using vega without any change from KFP. And we can also make a python wrapper for that, if there isn't already one.
How popular is it among data scientists?
Hard to quantify it's popularity as there are too many vis tools out there.
It is probably more an engineer tool rather than a data scientist tool - as in it is generally used more as a specification for vis (i.e. instead of saving png of the charts, you save the spec together with ur experiment metadata, params, dataset, etc).
For data scientist, they probably use an abstraction layer on top of it, e.g. altair or py-vega.
How big is bundle size, do we need to include both vega and vega-lite?
Fairly big as it is quite comprehensive. 165 kb for Vega. Not sure about Vega lite. U only need Vega lite to transpile Vega lite spec into Vega spec.
U probably can do tree shaking to remove features u don't intend to support.
If size is a concern, we can do server side rendering for the vis. Vega can output svg, png/jpeg, or data url.
Would it be enough if we provide some documentation of using them with embeded HTML?
That would work, but that adds overheads to data scientist. Or do u think it is a better solution for me to add a ui-metadata sdk?
Because one of the issue I have is that I always have to search for the format of ui-metadata and where to store it, and how to generate it for my kfp operator. In the ideal world, I would prefer a simple sdk to generate whatever vis I wants, without needing to know the actual io.
import kfp.dsl
from kfp.dsl.vis import ConfusionMatrix, WebVis
@kfp.dsl.pipeline()
def some_pipeline():
op = some_op()
conf_mat = ConfusionMatrix(..., source= op.outputs.data1)
// Or
op.add_vis(WebVis(html=some_html_creator_func))
It is probably not important enough a justification to switch to Vega unless kubeflow is going to provider a richer sets of vis.
But I like Vega particularly because the grammar is elegant and easy to remember. And switching between diff vis for same data is quite trivial - i.e. because it is composition rather than templates (unlike many other solutions, diff chart types have diff params - Vega has very good separation).
Tldr Essentially, I wanted to package vis artifacts as Vega spec together with data artifacts - i.e. vis should have it's own consistent specification, and shld be stored just like data artifacts.
And the front end can render vis artifacts as is, w/o much additional work.
And these vis artifacts can be used at different parts of kubeflow or other apps because Vega spec can be used as a common standard for vis artifacts - i.e. it is easy to render Vega charts with provided spec.
Currently, there is no consistent std for vis in kubeflow as there is a mix of solutions - from dynamically generated py vis, to custom format for specific vis (e.g. roc, confusion matrix, etc), to html web app.
Alternatively, we can consider a separate vis service for kubeflow with its own crd - generates the required vis from a rest or grpc service.
Something like this
I have a simple cloud function to render my chart (that takes a data from a http source) as a png.
Or as a web app link Link
I think my main argument is that, vega support is a feature that can be made convenient completely by 3rd party library/components, so I'm not seeing strong enough reason to integrate it in KFP system.
Because there are also other visualization libraries and new libraries coming out.
The only exception is -- If we re-implement or introduce new first-party visualizations using vega directly, then it's probably worth it to support vega json spec directly though.
yeah, i agree with you on that. I probably can build it as an extension/plugin outside of kfp.
But what do u think about my suggestion of adding a vis sdk in the kfp dsl? Not about vega but more on my pain with /mlpipeline-ui-metadata.json
.
Cuz the biggest pain point for me when creating kfp ops is remember the path and the format (i.e. how to populate this json).
I am proposing to add a kfp.dsl.vis
module which can either:
tf.io
to generate the /mlpipeline-ui-metadata.json
inside the op itself.Maybe I will do an actual mvp as a kfp.contrib
to demonstrate my idea.
@eterna2 I'm no expert on sdk, but personally it also took me some hard time to figure out how to write metadata with sdk, so I'd prefer sdk have builtin support.
/cc @Ark-kun /cc @hongye-sun /cc @numerology for sdk related proposal
Ok, I created a kfx package at https://github.com/e2fyi/kfx/ to demonstrate my idea.
It works now, although I feel it is abit convoluted.
In this example, I am using ArtfactLocationHelper
to modify the kfp task with env variables that contains metadata about the Argo configs (needs to be inputted by user)
KfpArtifact
to retrieve these metadata to generate the url (to be used as source for mlpipeline-ui-metadata) to the artifact as well as the API call to UI to get the artifact (for data loading in Vega).This is bad mostly because it relies on the user to know the Argo configmap and to set it.
I would prefer the UI artifact API to be able to support workflow.name
or some identifier - i.e. instead of just source, bucket and key, we can support workflow.name
+ artifact name
, where these can be used to retrieve the necessary info to get the artifact (similar to what I did to get the pod logs from Argo artifactory).
This removes the need for user to know anything about Argo. And we can meta-declare a source or url to be an artifact generated by kfp tasks.
import kfp.components
import kfp.dsl
import kfx.dsl
# creates the helper that has the argo configs (tells you how artifacts will be stored)
# see https://github.com/argoproj/argo/blob/master/docs/workflow-controller-configmap.yaml
helper = kfx.dsl.ArtifactLocationHelper(
scheme="minio", bucket="mlpipeline", key_prefix="artifacts/"
)
@kfp.components.func_to_container_op
def test_op(
mlpipeline_ui_metadata: OutputTextFile(str), markdown_data_file: OutputTextFile(str)
):
"A test kubeflow pipeline task."
import json
import kfx.dsl
import kfx.vis
import kfx.vis.vega
data = [
{"a": "A", "b": 28},
{"a": "B", "b": 55},
{"a": "C", "b": 43},
{"a": "D", "b": 91},
{"a": "E", "b": 81},
{"a": "F", "b": 53},
{"a": "G", "b": 19},
{"a": "H", "b": 87},
{"a": "I", "b": 52},
]
vega_data_file.write(json.dumps(data))
# `KfpArtifact` provides the reference to data artifact created
# inside this task
spec = {
"$schema": "https://vega.github.io/schema/vega-lite/v4.json",
"description": "A simple bar chart",
"data": {
"url": kfx.dsl.KfpArtifact("vega_data_file"),
"format": {"type": "json"},
},
"mark": "bar",
"encoding": {
"x": {"field": "a", "type": "ordinal"},
"y": {"field": "b", "type": "quantitative"},
},
}
# write the markdown to the `markdown-data` artifact
markdown_data_file.write("### hello world")
# creates an ui metadata object
ui_metadata = kfx.vis.kfp_ui_metadata(
# Describes the vis to generate in the kubeflow pipeline UI.
[
# markdown vis from a markdown artifact.
# `KfpArtifact` provides the reference to data artifact created
# inside this task
kfx.vis.markdown(kfx.dsl.KfpArtifact("markdown_data_file")),
# a vega web app from the vega data artifact.
kfx.vis.vega.vega_web_app(spec),
]
)
# writes the ui metadata object as the `mlpipeline-ui-metadata` artifact
mlpipeline_ui_metadata.write(kfx.vis.asjson(ui_metadata))
# prints the uri to the markdown artifact
print(ui_metadata.outputs[0].source)
@kfp.dsl.pipeline()
def test_pipeline():
"A test kubeflow pipeline"
op: kfp.dsl.ContainerOp = test_op()
# modify kfp operator with artifact location metadata through env vars
op.apply(helper.set_envs())
I also written the pydantic data models for the mlpipeline-ui-metadata
and generated the corresponding json scheme for the file.
@eterna2 Looks great! A quick question: is it a requirement to store visualization data in external source? If you just store it inline inside mlpipeline-ui-metadata, then user doesn't need to know so much other context.
I guess you have good reasons to do so, just wanting to know.
Depends on the data size. For small dataset, we probably can inline.
Becuz my prev use cases are mostly geospatial simulation which generates quite a bit of logs.
And usually we want to store these logs separately.
But I agree that inline shld solve 90% of the use cases. And probably a better solution. Did not think of that.
Probably, I can generate multiple "baked" vis, separately from the logs.
In this case, I probably can provide helper classes to convert sklearn confusion matrix etc into inline UI metadata.
I see, external source is definitely needed if data size is huge or ACL is needed on the data. That seems like complexity unrelated to KFP. A helper sdk will be helpful in this case.
Also for inline cases, helper is as good as 1st party integrations.
I encountered some issues getting my artifacts to work inside the iframe as the iframe does not provide allow-same-origin
permission - request origin = null.
Aka I can't make any request to the node server.
The only 3 solutions I see are:
Both seems equally risky from security pov.
Yes, the visualization is expected to only access inline data and open data for security reasons.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
Background
Currently, kfp manages visualization through a collections of
viewer
components.Ignoring viewers like markdown, html, tensorboard, etc, visualization in kfp can be separated into 2 groups:
react-vis
(usually via metadata-ui artifacts)display
function)Proposal
Vega/Vega-Lite to be used as a visualization dsl for
Pros
language agnostic: Uses a JSON-based dsl to describe visualization
simple: Simple and concise grammar to generate most common visualization (esp Vega-Lite) Example: barchart
supports multiple data format: e.g. csv, tsv, geojson/topojson (for maps), json, etc
supports multiple/custom loader types: e.g. http request, inlined, data stream, etc
composable: Vega dsl is designed to be composable, which makes it easy to create visualization from existing vis components - i.e. easy to wrap the dsl as composable vis components in UI.
can be implemented entirely on client-side: Do not need to have any backend service to generate the html for the visualization
Cons
Non I can think of.
Concept Details
mlpipeline-ui-metadata.json
artifact to support Vega and Vega-Lite:i.e.
react-vis
withreact-vega
:It is easier to map existing ui-metadata schema to a vega-lite spec then generate the corresponding vis component, rather than implementing individual vis component, each time we need to support a new vis.
See https://vega.github.io/editor/#/examples/vega-lite/airport_connections