microsoft / lida

Automatic Generation of Visualizations and Infographics using Large Language Models
https://microsoft.github.io/lida/
MIT License
2.6k stars 266 forks source link

can the visualize model support Chinese in generaled image? #68

Open nuaabuaa07 opened 7 months ago

nuaabuaa07 commented 7 months ago

I had try "altair"、"matplotlib"、”seaborn“、”ggplot“、”plotly“ in the params on 'library' in the method ' lida.visualize( ... , library="seaborn")' ,but got error output image when there has Chinese in my data. Can It support Chinese showing?

Shu-Ji commented 7 months ago

You can monkey path lida to make this work:

Add the two lines on top of your own python script.

from monkey_patch_lida import monkey_patch_lida

monkey_patch_lida()

from lida import Manager
lida = Manager(text_gen='openai')
...

在Mac测试的,你可以根据自己的系统选择一个可用的字段,如“微软雅黑”;

monkey_patch_lida.py

from lida.components import goal
from lida.components.scaffold import ChartScaffold
from lida.components.viz import vizgenerator
from lida.components.viz.vizgenerator import VizGenerator
from lida.datamodel import Goal

def monkey_patch_lida():
    # 替换为简短的中文,不然会超出字符数
    goal.SYSTEM_INSTRUCTIONS = '''
    你是一个经验丰富的数据分析师,可产生给定个数的关于数据的有见地的目标。
    你建议的可视化必须遵循可视化最佳实践(如,必须使用条形图而不是饼图来比较数量),并且要有意义(如,在适当的情况下在地图上绘制经纬度)。
    每个目标必须包括一个question、一个visualization(可视化必须参考摘要中的准确列字段)和一个rationale(使用数据集字段的理由以及我们将从可视化中学到什么)。
    每个目标都必须提到数据集摘要中的确切字段。'''

    goal.FORMAT_INSTRUCTIONS = '''
    输出必须是合法的JSON代码,必须使用以下格式,除了代码不要有任何解释,列表中需要有3到10个问题:
    [
        { "index": 0,  "question": "X的分布是什么?", "visualization": "X的直方图", "rationale": "这告诉我们 "} ..
    ]
```
'''

vizgenerator.system_prompt = '''
你是一个乐于助人的助手,擅长为可视化编写完美的代码。
给定一些代码模板,在给定数据集和所描述的目标的情况下,完成模板以生成可视化。
您编写的代码必须遵循可视化最佳实践,即满足指定目标,应用正确的转换,使用正确的可视化类型,使用正确数据编码,并使用正确的美学(例如,确保轴清晰可见)。
应用的转换必须正确,使用的字段必须正确。可视化代码必须正确,不得包含任何语法或逻辑错误(例如,必须考虑字段类型并正确使用它们)。
你必须首先为你将如何解决任务生成一个简短的计划,例如你将应用什么转换,例如如果你需要构建一个新的列,你将使用什么字段,你将采用什么可视化类型,你将运用什么美学,等等。
'''

ChartScaffold.get_template = get_template_py

old_get_template = ChartScaffold.get_template old_generate = VizGenerator.generate

def get_template_py(self, goal: Goal, library: str):

general_instructions = f"If the solution requires a single value (e.g. max, min, median, first, last etc), ALWAYS add a line (axvline or axhline) to the chart, ALWAYS with a legend containing the single value (formatted with 0.2F). If using a where semantic_type=date, YOU MUST APPLY the following transform before using that column i) convert date fields to date types using data[''] = pd.to_datetime(data[], errors='coerce'), ALWAYS use errors='coerce' ii) drop the rows with NaT values data = data[pd.notna(data[])] iii) convert field to right time format for plotting. ALWAYS make sure the x-axis labels are legible (e.g., rotate when needed). Solve the task carefully by completing ONLY the AND section. Given the dataset summary, the plot(data) method should generate a {library} chart ({goal.visualization}) that addresses this goal: {goal.question}. DO NOT WRITE ANY CODE TO LOAD THE DATA. The data is already loaded and available in the variable data."

general_instructions = f"不要编写任何代码来加载数据。数据已加载到变量data中。代码需遵循python的4个空格的缩进风格,保证代码可正常运行,不会报错。"
matplotlib_instructions = f" {general_instructions} DO NOT include plt.show(). The plot method must return a matplotlib object (plt). Think step by step.\n"

if library == "matplotlib":
    instructions = {
        "role": "assistant",
        "content": f"  {matplotlib_instructions}. "}
    template = \
        f"""

import matplotlib.pyplot as plt import pandas as pd

__doc = ''' 方案计划 1. .. ''' def plot(data: pd.DataFrame): # 只修改这里的代码,一定注意你的代码需遵循python的4个空格的缩进风格。 plt.title('{goal.question}', wrap=True) # save the plot as PNG file # plt.savefig("plot.png") return plt; plt.rcParams['font.family'] = ['Arial Unicode MS'] plt.rcParams['axes.unicode_minus'] = False # data = pd.read_csv('./xh_mobile.csv') chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.""" elif library == "seaborn": instructions = { "role": "assistant", "content": matplotlib_instructions} template = \ f""" import seaborn as sns import pandas as pd import matplotlib.pyplot as plt __doc = ''' 方案计划 1. .. ''' def plot(data: pd.DataFrame): # 只修改这里的代码,一定注意你的代码需遵循python的4个空格的缩进风格。 plt.title('{goal.question}', wrap=True) # save the plot as PNG file # plt.savefig("./imgs/plot.png") return plt; # data = pd.read_csv('./xh_mobile.csv') plt.rcParams['font.family'] = ['Arial Unicode MS'] # 用来正常显示中文标签 plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号 sns.set_style('whitegrid', {{'font.sans-serif': ['Arial Unicode MS', 'Arial']}}) chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.""" else: raise ValueError( "Unsupported library. Choose from 'matplotlib', 'seaborn', 'plotly', 'bokeh', 'ggplot', 'altair'." ) return template, instructions ```
ccccler commented 4 months ago

You can monkey path lida to make this work:

Add the two lines on top of your own python script.

from monkey_patch_lida import monkey_patch_lida

monkey_patch_lida()

from lida import Manager
lida = Manager(text_gen='openai')
...

在Mac测试的,你可以根据自己的系统选择一个可用的字段,如“微软雅黑”;

monkey_patch_lida.py

from lida.components import goal
from lida.components.scaffold import ChartScaffold
from lida.components.viz import vizgenerator
from lida.components.viz.vizgenerator import VizGenerator
from lida.datamodel import Goal

def monkey_patch_lida():
    # 替换为简短的中文,不然会超出字符数
    goal.SYSTEM_INSTRUCTIONS = '''
    你是一个经验丰富的数据分析师,可产生给定个数的关于数据的有见地的目标。
    你建议的可视化必须遵循可视化最佳实践(如,必须使用条形图而不是饼图来比较数量),并且要有意义(如,在适当的情况下在地图上绘制经纬度)。
    每个目标必须包括一个question、一个visualization(可视化必须参考摘要中的准确列字段)和一个rationale(使用数据集字段的理由以及我们将从可视化中学到什么)。
    每个目标都必须提到数据集摘要中的确切字段。'''

    goal.FORMAT_INSTRUCTIONS = '''
    输出必须是合法的JSON代码,必须使用以下格式,除了代码不要有任何解释,列表中需要有3到10个问题:
    [
        { "index": 0,  "question": "X的分布是什么?", "visualization": "X的直方图", "rationale": "这告诉我们 "} ..
    ]
```
'''

vizgenerator.system_prompt = '''
你是一个乐于助人的助手,擅长为可视化编写完美的代码。
给定一些代码模板,在给定数据集和所描述的目标的情况下,完成模板以生成可视化。
您编写的代码必须遵循可视化最佳实践,即满足指定目标,应用正确的转换,使用正确的可视化类型,使用正确数据编码,并使用正确的美学(例如,确保轴清晰可见)。
应用的转换必须正确,使用的字段必须正确。可视化代码必须正确,不得包含任何语法或逻辑错误(例如,必须考虑字段类型并正确使用它们)。
你必须首先为你将如何解决任务生成一个简短的计划,例如你将应用什么转换,例如如果你需要构建一个新的列,你将使用什么字段,你将采用什么可视化类型,你将运用什么美学,等等。
'''

ChartScaffold.get_template = get_template_py

old_get_template = ChartScaffold.get_template old_generate = VizGenerator.generate

def get_template_py(self, goal: Goal, library: str):

general_instructions = f"If the solution requires a single value (e.g. max, min, median, first, last etc), ALWAYS add a line (axvline or axhline) to the chart, ALWAYS with a legend containing the single value (formatted with 0.2F). If using a where semantic_type=date, YOU MUST APPLY the following transform before using that column i) convert date fields to date types using data[''] = pd.to_datetime(data[], errors='coerce'), ALWAYS use errors='coerce' ii) drop the rows with NaT values data = data[pd.notna(data[])] iii) convert field to right time format for plotting. ALWAYS make sure the x-axis labels are legible (e.g., rotate when needed). Solve the task carefully by completing ONLY the AND section. Given the dataset summary, the plot(data) method should generate a {library} chart ({goal.visualization}) that addresses this goal: {goal.question}. DO NOT WRITE ANY CODE TO LOAD THE DATA. The data is already loaded and available in the variable data."

general_instructions = f"不要编写任何代码来加载数据。数据已加载到变量data中。代码需遵循python的4个空格的缩进风格,保证代码可正常运行,不会报错。"
matplotlib_instructions = f" {general_instructions} DO NOT include plt.show(). The plot method must return a matplotlib object (plt). Think step by step.\n"

if library == "matplotlib":
    instructions = {
        "role": "assistant",
        "content": f"  {matplotlib_instructions}. "}
    template = \
        f"""

import matplotlib.pyplot as plt import pandas as pd

__doc = ''' 方案计划 1. .. ''' def plot(data: pd.DataFrame): # 只修改这里的代码,一定注意你的代码需遵循python的4个空格的缩进风格。 plt.title('{goal.question}', wrap=True) # save the plot as PNG file # plt.savefig("plot.png") return plt; plt.rcParams['font.family'] = ['Arial Unicode MS'] plt.rcParams['axes.unicode_minus'] = False # data = pd.read_csv('./xh_mobile.csv') chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.""" elif library == "seaborn": instructions = { "role": "assistant", "content": matplotlib_instructions} template = \ f""" import seaborn as sns import pandas as pd import matplotlib.pyplot as plt __doc = ''' 方案计划 1. .. ''' def plot(data: pd.DataFrame): # 只修改这里的代码,一定注意你的代码需遵循python的4个空格的缩进风格。 plt.title('{goal.question}', wrap=True) # save the plot as PNG file # plt.savefig("./imgs/plot.png") return plt; # data = pd.read_csv('./xh_mobile.csv') plt.rcParams['font.family'] = ['Arial Unicode MS'] # 用来正常显示中文标签 plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号 sns.set_style('whitegrid', {{'font.sans-serif': ['Arial Unicode MS', 'Arial']}}) chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.""" else: raise ValueError( "Unsupported library. Choose from 'matplotlib', 'seaborn', 'plotly', 'bokeh', 'ggplot', 'altair'." ) return template, instructions ```

请问我在读取带有中文的数据时,报错就显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte,这应该怎么修改?谢谢!

Shu-Ji commented 4 months ago

请问我在读取带有中文的数据时,报错就显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte,这应该怎么修改?谢谢!

我用的python3,csv里面有中文,没有报错。 你看看是不是summary这一步报错了?

ccccler commented 4 months ago

请问我在读取带有中文的数据时,报错就显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte,这应该怎么修改?谢谢!

我用的python3,csv里面有中文,没有报错。 你看看是不是summary这一步报错了?

是的, 到summary这一步就出问题了。再往上一行的报错是 File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.cinit。请问你的数据也是csv吗?是标题和行名称里都有中文吗?

kishoretvk commented 3 months ago

can we use llama or mistral for summarization and cna we picasso or stable diffusion for image generation? ?

Shu-Ji commented 3 months ago

请问我在读取带有中文的数据时,报错就显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte,这应该怎么修改?谢谢!

我用的python3,csv里面有中文,没有报错。 你看看是不是summary这一步报错了?

是的, 到summary这一步就出问题了。再往上一行的报错是 File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.cinit。请问你的数据也是csv吗?是标题和行名称里都有中文吗?

我字段名没有中文,内容有中文