[Feat] get vizro ai customized text output

mckinsey / vizro

Vizro is a toolkit for creating modular data visualization applications.

https://vizro.readthedocs.io/en/stable/

Apache License 2.0

2.46k stars 109 forks source link

[Feat] get vizro ai customized text output #488

Closed Anna-Xiong closed 3 weeks ago

Anna-Xiong commented 1 month ago

Description

we will have a dataclass for Plot components.

@dataclass
class Plot:
    """Data class about a vizro ai plot."""

    code_string: str
    fig: go.Figure = field(default=None)
    business_insights: str = field(default=None)
    code_explanation: str = field(default=None)

here is our updated plot() call.

    def plot(
        self,
        df: pd.DataFrame,
        user_input: str,
        explain: bool = False,
        max_debug_retry: int = 3,
        return_plot_components: bool = False,
    ) -> Union[go.Figure, Plot]:

explain controls whether to run extra LLM to get text insights of business and code explanation
return_plot_components controls whether to Plot dataclass or return only plotly fig object

Pending:

Integration test needs to be updated, another PR will address integration tests and

# TODO Tentative for integration test, will be updated/removed for new tests
    if self._return_all_text:
        return output_dict

will be deleted afterwards.

[x] I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
- I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
- I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
- I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.
- I have not referenced individuals, products or companies in any commits, directly or indirectly.
- I have not added data or restricted code in any commits, directly or indirectly.

maxschulz-COL commented 1 month ago

I like it! I think this should have the exact same API as the plot function, and it does!

A couple of ideas for suggestions:

maybe tune the name so we know the return type: get_plot_dict?
Warning is nice, but I think not even necessary, could even skip it

Other than that, think this is great 💪 let's do it

Will do a final review once you finalised it, thanks!

Anna-Xiong commented 1 month ago

Thank you!

maybe tune the name so we know the return type: get_plot_dict?

I am leaning towards including output in the name, probably get_plot_output_dict if it is not too long?

Warning is nice, but I think not even necessary, could even skip it

Here we are using INFO instead of WARNING, it would be nice to notify user they haven't turned on explain and won't be able to get business insights and code explanation.

maxschulz-COL commented 1 month ago

Great review from Antony, and on a more thorough review I agree with some but not all points. Let me try to summarize:

Regarding what would feel better

I disagree that we should return a dataclass for plot - a core idea for simplicity is that plot returns a figure, and in jupyter that figure is automatically displayed.

Regarding the things that feel wrong

Regarding 1

Yes, I 100% agree and I am annoyed I didn't think of it before given that this is exactly what is done in DS tools. So for the component collection, yes 100% dataclass.

Regarding 2

Does not feel great, but I think here it makes sense, because .plot is a very special function... It returns a fig object for simplicity (and that should remain as is, as in jupyter it shows the fig automatically and it is the core functionality), but in a jupyter environment, it also shows the insights and code snippet above the fig. That is potentially what is most confusing, but at the same time probably the best way to handle a complicated situation.

Thus, the second function with the same function signature was intended to return the same (not a subset) things, but in a way that is simple to further use (e.g. if you would like to just have the code string, or if you would like to return the insights separately in a chatbot).

Here is the 4 combinations, and what I think it should do:

In Jupter:

plot(...,explain=False) -> Return fig object, that is automatically displayed
plot(...,explain=True) -> Return fig object, that is automatically displayed, with explanation displayed above, not altering the return type

not in jupyter:

plot(...,explain=False) -> Return fig object, that can be displayed if the user chooses fig.show() themselves
plot(...,explain=True) -> Return fig object, that can be displayed if the user chooses fig.show() themselves. Print the explanation to the terminal (this is new, and I think this would make .plot more consistent

If the above is ensured, we just have to find a way to deliver all output to the user in an alternative way (which as said in 1 should definitely be a dataclass. I think there is broadly three options:

Variable return type based on a flag, e.g. .plot(...,return_components = True)
Alternative function (current proposed solution)
Alternative function but private method, so no pubic API (somewhat current solution)
(Scrap this idea, and just return components - A's preferred solution, which is disagree with)

Regarding 3

Yes, I think that would be wise, but I felt it was not the major focus for now

What is my actual preferred option

I have tried to show above that plot is not a subset of get_plot_outputs, but a sister function. That would suggest (I think) a variable return type. But it makes sense as you say to look into the future:

will there be additional things to return that do not make sense to be part of .plot
will there be situations, where plot interacts with other parts of VizroAI, will it be a burden to sync with a potential different function (thinking of iteration, usage in .dashboard, etc)
do we need to think about iteration, will that be part of the same function?

Long story short - we simply cannot tell. Thus I prefer to have a single function with a variable return type that is clear to maintain: it returns either the fig with all other components shown alongside OR it returns a dataclass with all the components as attributes.

That is also easy to explain in docs, and easy to use. If plot becomes involved in other parts of the program, at least we only have to think about plot and not something else.

antonymilne commented 1 month ago

Great points @maxschulz-COL! You are absolutely right that it's good to automatically show the figure in a Jupyter notebook, which I hadn't thought of before. That would actually still be possible with my "ideal" solution above since you could delegate the relevant methods to the underlying figure:

@dataclass
class Plot:
    # I'm just taking the same property names as you use now, even if they could maybe be improved a bit
    business_insights: str
    code_explanation: str
    code_string: str
    fig: go.Figure

    # Whatever methods are needed
    def _repr_html_(...):
        return fig._repr_html_(...)

This is very similar to one of the prototypes I had for our _DashboardReadyFigure which had similar requirements of needing to behave like a go.Figure in Jupyter while requiring extra behaviour for use in a dashboard. There I went for inheritance over composition in the end. One of the reasons for this is it turned out that while it's possible to do the above it's actually a bit painful to implement. And given we have the constraint of needing to return a go.Figure already given the change in vizro-ai 0.2.0, let's not do this.

We could actually also follow the subclassing approach like we do with _DashboardReadyFigure and do class Plot(go.Figure) and put a business_insights property into that class to get the best of both worlds where you get something that behaves like go.Figure but also exposes additional properties. This is pretty ugly and also a little awkward to implement so I don't think I'm a fan of it but might work ok here 🤔

Assuming we don't like that then I agree with your solution 1 here that we have a single plot function but with an argument that alters the return type. But I find the existence of both explain and return_components a bit confusing and wonder if we could just use the explain argument for both?

plot(..., explain=False): return fig object and don't show any explanation, just like it does now.
plot(..., explain=True): return dataclass (different from now). If in Jupyter then display explanation on screen (as done now) and also fig on screen (needed since we no longer directly return the figure object). If outside Jupyter then personally I don't care whether we show the explanation or not, since it's available in the return dataclass anyway.

My assumption here is that in 99% of cases explain is basically equivalent to return_components. The only cases that this solution renders impossible would be:

explain=True and return_components=False: you want to see the explanation but don't want to use it further. With my above solution you set explain=True and get extra components in your return object and you can just ignore those if you like so this seems fine since the graph is still displayed
explain=False and return_components=True: again just set explain=True. The only difference here is the explanation gets shown (at least in Jupyter notebooks) when maybe you don't want that, but that seems ok to me also.

maxschulz-COL commented 4 weeks ago

Assuming we don't like that then I agree with your solution 1 here that we have a single plot function but with an argument that alters the return type. But I find the existence of both explain and return_components a bit confusing and wonder if we could just use the explain argument for both?

plot(..., explain=False): return fig object and don't show any explanation, just like it does now.

plot(..., explain=True): return dataclass (different from now). If in Jupyter then display explanation on screen (as done now) and also fig on screen (needed since we no longer directly return the figure object). If outside Jupyter then personally I don't care whether we show the explanation or not, since it's available in the return dataclass anyway.

Happy with this! Would this not be another breaking change, because now the return type of the plot function with explain=True is actually go.Figure and it would change to dataclass?

antonymilne commented 4 weeks ago

Haha, I knew you would ask that 😀 And it's definitely the right question to ask.

Technically yes, it's a breaking change. But breaking changes are always a bit of a grey area, relative to how things are actually used and not clear cut. Within the context of how people use VizroAI, would anyone be doing explain=True and using the go.Figure directly? My feeling is probably no: if you want to use the figure then you would do explain=False (the default, so undoubtedly the most common anyway). The result for Jupyter users would look the same because we'd still show the plot in the notebook. The problem would only come if you're actually consuming that plot in some way with explain=True.

Given that we only just did the breaking release 0.2.0 and changed the return type there I would be inclined to do this more as a "new feature" rather than breaking change, but it's definitely debatable. I don't mind really - up to you guys to think about how many people this actually would break the workflow for.

maxschulz-COL commented 4 weeks ago

Given that we only just did the breaking release 0.2.0 and changed the return type there I would be inclined to do this more as a "new feature" rather than breaking change, but it's definitely debatable. I don't mind really - up to you guys to think about how many people this actually would break the workflow for.

I am ok with it! Let's see what @Anna-Xiong thinks

Anna-Xiong commented 3 weeks ago

Thanks for the thorough reviews and suggestions on enhancing the plot function of VizroAI. After considering all the inputs, I have updated the code accordingly and have a few point would like to contribute here:

Agreement on Single plot Call: I agree with using a single plot method that adjusts return type based on the provided arguments. The reason is the same as our first POC: given that even if we have two calls, there would be a duplicate of almost all arguments.

Support for Dataclass: Using a dataclass to return structured data is a great idea and easier to use than dictionary for sure. updated that already.

Concerns with only using explain: However, I have reservations about using only explain parameter to control all aspects of the output.

Our users might have requirements where they need the figure object and code string to integrate into chat application without the explanation or vice versa. In fact, that is one of the most common requirement we have seen and this is also what our chat application uses as well.
This also involves calling extra layer of LLM or not(cost and speed relevant). In short, I would summarize the responsibility of explain and return_plot_components as below:
- explain only controls whether to run extra LLM to get text insights of business and code explanation
- return_plot_components controls whether to Plot dataclass or return only plotly fig object

I would be strongly leaning towards additional flags here that control the type of the output without merging them into a single parameter explain.

If we are not ok for this adding extra argument change, I would think it is better to have a separate get_plot_components() method that

VizroAI.get_plot_components() could return the dataclass so that we don't mix what explain controls,
and make VizroAI.plot() a short cut , it can return the fig from dataclass returned from VizroAI.get_plot_components()

Decision on Pylint Rule Adjustment: Lastly, as we expand our this VizroAI.plot()function, the additional argument brings our total to six, surpassing the current pylint rule of five arguments per function. We need to decide if we should adjust our linting rules to accommodate this change.

Code Updates: I've made some updates to the existing prototype to reflect our discussions so far.

We care more about auto rendering in jupyter environment, so if users are using jupyter, the utils has been refactoring to take care of auto rendering of fig. No matter what users specify in plot() call, the fig is always rendered in jupyter, and not rendered automatically in other environment, which they can add one line of code to show the fig(exact same as our current plot() call, no breaking change).

plot(...,explain=False, return_plot_components= True) -> Return dataclass Plot that consists of code_string and fig, the fig is automatically displayed if in jupyter
plot(...,explain=False, return_plot_components= False) -> Return fig directly, the fig is automatically displayed if in jupyter.
plot(...,explain=True, return_plot_components= True) -> Return dataclass Plot that consists of code_string and fig, business_insights , and code_explanation. The fig is automatically displayed and the Markdown is rendered for text information if in jupyter
plot(...,explain=True, return_plot_components= False) -> Return fig directly, The fig is automatically displayed and the Markdown is rendered for text information if in jupyter

Also return_plot_components will be default to False, so we don't cause any breaking changes from the users' side.

Let's take a look at the code and share your thoughts on these points, especially regarding the flexibility of the output controls and the pylint rule adjustment.

maxschulz-COL commented 3 weeks ago

Concerns with only using explain: However, I have reservations about using only explain parameter to control all aspects of the output.

Our users might have requirements where they need the figure object and code string to integrate into chat application without the explanation or vice versa. In fact, that is one of the most common requirement we have seen and this is also what our chat application uses as well.

This also involves calling extra layer of LLM or not(cost and speed relevant). In short, I would summarize the responsibility of explain and return_plot_components as below:

explain only controls whether to run extra LLM to get text insights of business and code explanation

return_plot_components controls whether to Plot dataclass or return only plotly fig object

I would be strongly leaning towards additional flags here that control the type of the output without merging them into a single parameter explain.

This is a very valid point, and I had not thought of that. I think this makes a strong case for actually adding an extra parameter. It also avoids the "breaking change".

If we are not ok for this adding extra argument change, I would think it is better to have a separate get_plot_components() method that

Decision on Pylint Rule Adjustment: Lastly, as we expand our this VizroAI.plot()function, the additional argument brings our total to six, surpassing the current pylint rule of five arguments per function. We need to decide if we should adjust our linting rules to accommodate this change.

I think it is ok to add an ignore here - I think that pylint rule is generally good to have, but can feel too strict at times.

Code Updates: I've made some updates to the existing prototype to reflect our discussions so far.

We care more about auto rendering in jupyter environment, so if users are using jupyter, the utils has been refactoring to take care of auto rendering of fig. No matter what users specify in plot() call, the fig is always rendered in jupyter, and not rendered automatically in other environment, which they can add one line of code to show the fig(exact same as our current plot() call, no breaking change).

plot(...,explain=False, return_plot_components= True) -> Return dataclass Plot that consists of code_string and fig, the fig is automatically displayed if in jupyter

plot(...,explain=False, return_plot_components= False) -> Return fig directly, the fig is automatically displayed if in jupyter.

plot(...,explain=True, return_plot_components= True) -> Return dataclass Plot that consists of code_string and fig, business_insights , and code_explanation. The fig is automatically displayed and the Markdown is rendered for text information if in jupyter

plot(...,explain=True, return_plot_components= False) -> Return fig directly, The fig is automatically displayed and the Markdown is rendered for text information if in jupyter

Didn't have time to check the details - is the mechanism that render the fig in the cases where return_plot_components= False the original Jupyter one? Just curious on the details.

Also return_plot_components will be default to False, so we don't cause any breaking changes from the users' side.

Great!

Let's take a look at the code and share your thoughts on these points, especially regarding the flexibility of the output controls and the pylint rule adjustment.

Overall I think we should do it as you say, and you are right, it is one of the most requested features, and it separates responsibility of arguments.

Anna-Xiong commented 3 weeks ago

@antonymilne @maxschulz-COL

Thank you for reviewing it again!

Didn't have time to check the details - is the mechanism that render the fig in the cases where return_plot_components= False the original Jupyter one? Just curious on the details.

Yes, it will render the fig where return_plot_components= False, and return a direct fig as well.

Then most questions remaining here would be relevant to naming of arguments and dataclass.

I followed @antonymilne suggestions on dataclass attributes and this new one looks simpler and easier to use,
but probably use return_elements instead of return_details would be better for the argument in plot() call? @maxschulz-COL do you have any other suggestion on names?

antonymilne commented 3 weeks ago

Just a final suggestion on the names: return_elements is ok but I think I'd prefer return_full overall if that makes sense for you two also? @maxschulz-COL @Anna-Xiong

Anna-Xiong commented 3 weeks ago

Just a final suggestion on the names: return_elements is ok but I think I'd prefer return_full overall if that makes sense for you two also? @maxschulz-COL @Anna-Xiong

I am ok either way, but slightly towards return_elements or return_full_elements? Let's see what @maxschulz-COL thinks?

maxschulz-COL commented 3 weeks ago

Just a final suggestion on the names: return_elements is ok but I think I'd prefer return_full overall if that makes sense for you two also? @maxschulz-COL @Anna-Xiong

I am ok either way, but slightly towards return_elements or return_full_elements? Let's see what @maxschulz-COL thinks?

I prefer return_elements - I find that is really what it is.

antonymilne commented 3 weeks ago

Fine by me, let's go for return_elements then 🙂