kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.93k stars 904 forks source link

Exploring Metrics on Experiment Tracking - User Testing Synthesis #1627

Closed NeroOkwa closed 1 year ago

NeroOkwa commented 2 years ago

Description

Ability to plot experiment metrics derived from pipeline runs.

This is based on the second high priority issue resulting from the experiment tracking user research, which is:

Visualisation: Ability to show plot /comparison graphs/hyper parameters to evaluate metrics tradeoff

What is the problem? 

Who are the users of this functionality?

Why do our users currently have this problem?

What is the impact of solving this problem?

What could we possibly do?

yetudada commented 2 years ago

For this issue, it's worth noting that we do have this functionality already.

It's just that:

antonymilne commented 2 years ago

Note that plotting metrics against parameters and/or kedro runs is a big topic which has been considered by many different tools and also discussed by us before: https://github.com/quantumblacklabs/private-kedro/issues/1192 https://github.com/kedro-org/kedro/issues/1070 (copy of above issue to public repo but missing some posts)

Just don't want previous discussions or existing solutions from other products to be forgotten about here 🙂

tynandebold commented 2 years ago

We should be careful with our assumptions here. Some notes about that:

Bottom line is that the data aren't always going to be nice, not always between 0 and 1, or play nice together if a user is tracking multiple metrics on one plot.

Do reference @AntonyMilneQB's comment here for more context.

Let's pick @noklam's brain about this, too. He may have some great real-world experience with some other tools in this space that do similar things.

noklam commented 2 years ago

As I understand we are discussing comparison plots across runs here.

As is right now, the X-axis is timestamp, and that's impractical. Should have a way for it to be uniform so you don't have clusters of runs and then huge gaps in time.

This feature are almost available for most experiment tracking tool, but this is usually for a X-axis within the same run, but I think it's mostly valid for cross-runs as well.

See similar things on Weight & Biases, which is really flexible and you can configure image

Y-axis doesn't need to only be between 0 and 1. It can be arbitrarily high or low and it's very possible you'd want to plot multiple metrics on the same scale, one with a huge scale range and then another one that's very small. You could normalize the scales or use a parallel coordinates plot.

I think it all makes sense, but some of the features would be difficult to implement, and the live plot is mentioned in this issue. The more raw data you keep, the more flexible you can customize these plots later. Another limiting factor for the live plot is we only save output at the end of a node execution. We need to keep data at a more granular level to support live plots and these chart customizations. It will be a huge change on the backend though and doesn't feel quite well with the node execution paradigm.

Side note: AFAIK W&B is also running with GraphQL API with vega(or vega-lite), which is based on d3.js. In python there is altair which support vega-lite. This crazy example shows how customizable it could be, though it's not a common use case.

antonymilne commented 2 years ago

Just to clarify, I don't think live plotting of metric vs. epoch is in scope here at all (as @noklam says, we can't do anything like that without a lot more work on kedro core and it would be quite a paradigm shift). For now we're just concerned with comparing metrics saved as a dataset (so from a node output) in one kedro run vs. the same dataset(s) in another kedro run. What does work "live" here is that when you do kedro run, the newest datasets are available in Kedro-Viz straight away without refreshing or needing to restart the server thanks to the GraphQL subscription.

yetudada commented 2 years ago

Hey everyone! I won't be in the Experiment Tracking review session tomorrow and I just have some thoughts on the current prototype design.

So from what I understand the original problem we're supposed to be solving is: "I'm choosing not to use Kedro-Viz Experiment Tracking because it doesn't allow me to visualise metrics over time."

I may be wrong but I assumed it would as simple as saying, _"I've done 20 pipeline runs, I was tracking mean_absolute_percentage_error and I want to see how my mean_absolute_percentage_error changed over time by looking at a plot of the values against time on a chart."_ Is this view correct or incorrect?

The reason I ask this is because:

So at the end of the day, the question becomes which problem are we solving for our users to increase adoption of Kedro-Viz Experiment Tracking? Are our users choosing not to use Kedro-Viz Experiment Tracking because:

I'm inclined to think it's the first problem but I'm also happy to be proven wrong on this. So keeping in mind that I'm also making assumptions throughout this piece, I would propose the following structure for user testing, which would provide more insights into the impact of not delivering on either of those problem statements:

  1. Show the users how to find the metrics plots on the Pipeline Visualisation using demo.kedro.org
  2. Ask for feedback; does this feature support a way for them to visualise metrics over time? And how can it be improved?
  3. Show them the new prototype and ask similar questions. The assumption that users can only compare three experiments must be stated to the users.
  4. Ask the users if they would solely use the new prototype in their work and would not need the metrics plots on the Pipeline Visualisation tab to do their work because the success of this design should be that they don't need the first view that we shipped.

Visual References

A

Screenshot 2022-08-22 at 18 54 56

B

Screenshot 2022-08-22 at 18 57 09

C

Screenshot 2022-08-22 at 18 53 13

antonymilne commented 2 years ago

tl;dr: speaking as an ex-PAI user I think parallel coordinates plot is better than time series plot. They both solve very similar problems, but parallel coordinates plot seems more powerful. I don't see why we shouldn't support both but would definitely prioritise the parallel coordinates plot.


I may be wrong but I assumed it would as simple as saying, _"I've done 20 pipeline runs, I was tracking mean_absolute_percentage_error and I want to see how my mean_absolute_percentage_error changed over time by looking at a plot of the values against time on a chart."_ Is this view correct or incorrect?

I think this is both correct and incorrect 😀 Basically I don't think the two different problems you're posing are all that different. At the end of the day, they both boil down to: for each (kedro run, metric name) point there is a metric value. How do I compare metric value across many (kedro run, metric name) points? Saying "I want to track a metric over time" doesn't necessarily mean "I want a plot of metric vs. time". The parallel coordinates plot still has the ability to compare between even if there's not a time axis.

See "How to visualise metrics dataset" in https://github.com/kedro-org/kedro/issues/1070#issuecomment-979132359 for my full comments. A time series plot of one metric is one thing you might want to look at, but in reality such a plot is very limited:

So, in theory it's possible to do the full metric value vs. (kedro run, metric name) comparison on a time series plot, but it's not ideal and certainly the way PAI did it was not good enough for what we're trying to do here.

The parallel coordinates plot is not so different from the above, it's just a different way of showing metric value vs. (kedro run, metric name) . However, it seems to be generally more suitable than the time series plot since it doesn't suffer so much from the above problems. There's still some things we need to be careful of to make sure it works:

  1. user can to flip each metric axis (i.e. increasing value could go vertically up or down)
  2. user can show/hide metrics
  3. users can show/hide kedro runs

Crucially the first 2 of these were possible but the 3rd was missing in PAI (but should come naturally in kedro-viz because we already have the ability to choose which runs you're comparing). The other main problem with the PAI plot is that it's radial rather than parallel, which looks cool but is harder to use in reality.

Overall, I think there's a good reason that tools like neptune and wandb do parallel coordinates plots. It seems like the best way to compare metric value across many (kedro run, metric name). That's not to say that we shouldn't have a time series plot as well, but I think most people would end up using the parallel coordinates plot way more.

antonymilne commented 2 years ago

One final thought while it occurs to me: you can actually sort of retain the the time ordering in the parallel coordinates plot if you colour the lines somehow, e.g. to show the oldest ones fainter than the most recent ones. Not super important because I don't think the time ordering is that important, but at least highlighting the most recent run might be nice.

tynandebold commented 2 years ago

As I said in the meeting yesterday, my intuition and instinct around what a user may want for new features here isn't sharp. I defer to @AntonyMilneQB, @noklam, and others who have used things like this in the past while doing real DS/DE/ML work.

What I do think we need is consistency with our hierarchy of information and a viable amount of added value with whatever we develop next. A few things stood out to me during the meeting yesterday:

I'm excited to hear what our interviewees say when this is shown to them.

Lastly, calling out @noklam here. Please add some thoughts and comments if you have some. I think they're invaluable here!

antonymilne commented 2 years ago

While browsing the original issue I came across this from @mkretsch327 (ex-QB data scientist). Basically I think DS (me, Nok, Matt) like the parallel coordinates plot 👍

For a metrics-over-runs view, I've found a parallel coordinate-like plot(essentially a flattened version of the circular metrics plot from PerformanceAI) to be super-useful. A majority of the time I'm looking to see what runs resulted in metrics that are at the extremes of a range (high or low), and that chart ends up providing that information concisely, even for relatively large numbers of metrics.

yetudada commented 2 years ago

I'm happy for this. I will say that we will prioritise one view to solve the original user problem that was raised. At this point it's either parallel coordinates or time-series, it won't be both because we have other problems to solve once this is completed. And I want to feel certain that if we acted on https://github.com/kedro-org/kedro-viz/issues/1000 that we would be doing the right thing.

Admittedly, I am a bit nervous about the parallel plot because we had feedback about the spider diagram when we were evaluating PAI. I highlighted the relevant insight in dark pink.

Screenshot 2022-09-07 at 14 51 51 Click the image to head through to the research

comym commented 2 years ago

Let's see what users say.

I see how the spider diagram might be confusing for some (even though is the same thing as parallel coordinates). It might look cool for some but the fact it was circular added too much in [visual] complexity and more difficult readability. This is not an issue related to this specific graphic but a universal visual design fact. When flattening "the same" into a horizontal alignment it becomes much more digestible.

I understand picking one or the other for now for the sake of practicality and moving forward iteratively, but I would not ignore one or the other since they are different ways of exploring the data from different angles.

Again, let's ask the right questions and listen to what users say over the sessions. Loads of great insights are coming.

NeroOkwa commented 1 year ago

User Testing Synthesis - Results

Goal and Methodology

The goal of this session was to evaluate the usability and value risk of the proposed feature on #1627(tracking metrics over time) through a low-fidelity mockup and a high-fidelity prototype.

The research used a qualitative (interview 🎤 - 6 participants) and quantitative (polls 🗳️) approach across the QuantumBlack and open-source user bases.

1 - Experiment Tracking Use Case

Summary: 2/6 users currently use kedro experiment tracking feature. Experiment tracking was used by users to understand their experiments and to find the best one by iterating with different parameter, to produce different metrics. This was done using MLflow, Weights & Biases, and Tableau

2 - On Plotly Visualisation in Flowchart Mode

Summary: 3/6 users know of this feature and have used it to plot their metrics. One user mentioned that its location is non-intuitive and difficult to find for non-users

3 - Knowing which Metrics to track

Summary: 3/6 users start with a clear metric to track defined by the project, while others don’t and are more exploratory.

4 - On New Tab Design

Summary: All 6 users prefer this new tab design

5 - On Plots: Parallel Coordinates & Time Series

Summary: 2 users each like time series and parallel coordinate plots, and 2 users like and would use both plots for different use cases.

6 - On Comparison Mode

Summary: 4/6 users preferred comparison mode in parallel coordinate mode compared to time series. 1 user found comparison mode and the ‘metrics’ tab confusing.

7 - Pain Points

Summary: The most common pain point identified by 4/6 users was the axis, or the ability to change the scales or customize the values to be in percentages for easy comparison.

Features still missing for User’s Pain Point

Summary: There were general feature requests and those specific to the plots. The most common general features identified by 3/6 users was Filtering, followed by the ability to change the axis or Customize the metric values.

Problems we still need to consider for the future

yetudada commented 1 year ago

I'll close this 🥳 This theme is complete.