microsoft / lida

Automatic Generation of Visualizations and Infographics using Large Language Models
https://microsoft.github.io/lida/
MIT License
2.76k stars 296 forks source link

Seeking Expert Advice with Accurately Interpreting Data from Diverse Sources, such as Financial Data or Generic Health Data #35

Open on1onmangoes opened 1 year ago

on1onmangoes commented 1 year ago

Lida is a big deal and very impressive. It fits very well with so many use cases and features. Allows us to change the current tedious process and focus on end to end streamlined features that have a incredible impact. I want to ask whether its possible to provide additional content to Summary and Goals so that the can better evaluate the data and provide goals that are aligned to the user, system, and task. For example, financial data to evaluate accruals by an FP&A analyst vs. patient records for the consumption of the patient or the provider (different context) Description: Lida encounters challenges when dealing with files from diverse, previously unknown sources, like financial details from a presentation or general health data from a public report. These files can house multiple datasets spanning different topics or industries. Our existing processing system may not be adept at distinguishing and correctly interpreting these varied datasets, which can lead to inaccuracies.

Use Case: Imagine a scenario where a user imports a multi-faceted report from a financial institution. This report could blend revenue stats, expense breakdowns, and trends in health-related expenditures. Each dataset might have its unique structure and nuance.

Steps to reproduce:

Import a multifaceted file, like a financial report that contains both financial metrics and some generic health data points. Attempt to analyze or process the data with Lida. Expected Behavior: Lida should:

Automatically discern and classify different data sets. Offer user-guided options to set context or provide metadata tags for specific sections, ensuring accurate interpretation. Actual Behavior: Lida handles all datasets in a homogeneous manner, possibly leading to misinterpretation. Especially leans towards distribution metrics by year, rowid etc. which are not the core goals for providers or analysts.

Seeking ExpertAdvice Integrate advanced algorithms, possibly leveraging machine learning, to better detect varied data structures/types. Persona, or system behavior issue is aligned to this . Currently there is also a need to predefined core goals that I want to dynamically check against the unknown data sets for feasibility Allow users to set context manually, guiding Lida's data interpretation process For example views for patients vs. providers will be using the same data but different lens would be required for the goals. Again ties to persona issues. Ability to produce mock "insufficient data" charts in case of errors in goal to viz generation Add a review step where Lida proposes potential classifications for datasets, letting users confirm or adjust based on their understanding. Also inbuilt data clean up for "millions" dates and currency signs can help with LLM or traditional EDA and clean up

victordibia commented 1 year ago

Thanks for the detailed description and ideas! All very good! There is currently a PR that is slightly related, focused on enabling persona based exploration #11 . The idea is as follows

In the use case above, perhaps specifying a "finance" persona might help guide the model towards more domain specific goals.

Finally, if you can provide some sample input/output examples (even with synthetic data) for the issues above, that would be helpful!