nuaabuaa07 commented 7 months ago

I had try "altair"、"matplotlib"、”seaborn“、”ggplot“、”plotly“ in the params on 'library' in the method ' lida.visualize( ... , library="seaborn")' ，but got error output image when there has Chinese in my data. Can It support Chinese showing?

Shu-Ji commented 7 months ago

You can monkey path lida to make this work:

Add the two lines on top of your own python script.

from monkey_patch_lida import monkey_patch_lida

monkey_patch_lida()

from lida import Manager
lida = Manager(text_gen='openai')
...

在Mac测试的，你可以根据自己的系统选择一个可用的字段，如“微软雅黑”；

monkey_patch_lida.py

from lida.components import goal
from lida.components.scaffold import ChartScaffold
from lida.components.viz import vizgenerator
from lida.components.viz.vizgenerator import VizGenerator
from lida.datamodel import Goal

def monkey_patch_lida():
    # 替换为简短的中文，不然会超出字符数
    goal.SYSTEM_INSTRUCTIONS = '''
    你是一个经验丰富的数据分析师，可产生给定个数的关于数据的有见地的目标。
    你建议的可视化必须遵循可视化最佳实践（如，必须使用条形图而不是饼图来比较数量），并且要有意义（如，在适当的情况下在地图上绘制经纬度）。
    每个目标必须包括一个question、一个visualization（可视化必须参考摘要中的准确列字段）和一个rationale（使用数据集字段的理由以及我们将从可视化中学到什么）。
    每个目标都必须提到数据集摘要中的确切字段。'''

    goal.FORMAT_INSTRUCTIONS = '''
    输出必须是合法的JSON代码，必须使用以下格式，除了代码不要有任何解释，列表中需要有3到10个问题：

    [
        { "index": 0,  "question": "X的分布是什么？", "visualization": "X的直方图", "rationale": "这告诉我们 "} ..
    ]
```
'''

vizgenerator.system_prompt = '''
你是一个乐于助人的助手，擅长为可视化编写完美的代码。
给定一些代码模板，在给定数据集和所描述的目标的情况下，完成模板以生成可视化。
您编写的代码必须遵循可视化最佳实践，即满足指定目标，应用正确的转换，使用正确的可视化类型，使用正确数据编码，并使用正确的美学（例如，确保轴清晰可见）。
应用的转换必须正确，使用的字段必须正确。可视化代码必须正确，不得包含任何语法或逻辑错误（例如，必须考虑字段类型并正确使用它们）。
你必须首先为你将如何解决任务生成一个简短的计划，例如你将应用什么转换，例如如果你需要构建一个新的列，你将使用什么字段，你将采用什么可视化类型，你将运用什么美学，等等。
'''

ChartScaffold.get_template = get_template_py

old_get_template = ChartScaffold.get_template old_generate = VizGenerator.generate

def get_template_py(self, goal: Goal, library: str):

general_instructions = f"If the solution requires a single value (e.g. max, min, median, first, last etc), ALWAYS add a line (axvline or axhline) to the chart, ALWAYS with a legend containing the single value (formatted with 0.2F). If using a where semantic_type=date, YOU MUST APPLY the following transform before using that column i) convert date fields to date types using data[''] = pd.to_datetime(data[], errors='coerce'), ALWAYS use errors='coerce' ii) drop the rows with NaT values data = data[pd.notna(data[])] iii) convert field to right time format for plotting. ALWAYS make sure the x-axis labels are legible (e.g., rotate when needed). Solve the task carefully by completing ONLY the AND section. Given the dataset summary, the plot(data) method should generate a {library} chart ({goal.visualization}) that addresses this goal: {goal.question}. DO NOT WRITE ANY CODE TO LOAD THE DATA. The data is already loaded and available in the variable data."

general_instructions = f"不要编写任何代码来加载数据。数据已加载到变量data中。代码需遵循python的4个空格的缩进风格，保证代码可正常运行，不会报错。"
matplotlib_instructions = f" {general_instructions} DO NOT include plt.show(). The plot method must return a matplotlib object (plt). Think step by step.\n"

if library == "matplotlib":
    instructions = {
        "role": "assistant",
        "content": f"  {matplotlib_instructions}. "}
    template = \
        f"""

import matplotlib.pyplot as plt import pandas as pd

__doc = ''' 方案计划 1. .. ''' def plot(data: pd.DataFrame): # 只修改这里的代码，一定注意你的代码需遵循python的4个空格的缩进风格。 plt.title('{goal.question}', wrap=True) # save the plot as PNG file # plt.savefig("plot.png") return plt; plt.rcParams['font.family'] = ['Arial Unicode MS'] plt.rcParams['axes.unicode_minus'] = False # data = pd.read_csv('./xh_mobile.csv') chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.""" elif library == "seaborn": instructions = { "role": "assistant", "content": matplotlib_instructions} template = \ f""" import seaborn as sns import pandas as pd import matplotlib.pyplot as plt __doc = ''' 方案计划 1. .. ''' def plot(data: pd.DataFrame): # 只修改这里的代码，一定注意你的代码需遵循python的4个空格的缩进风格。 plt.title('{goal.question}', wrap=True) # save the plot as PNG file # plt.savefig("./imgs/plot.png") return plt; # data = pd.read_csv('./xh_mobile.csv') plt.rcParams['font.family'] = ['Arial Unicode MS'] # 用来正常显示中文标签 plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号 sns.set_style('whitegrid', {{'font.sans-serif': ['Arial Unicode MS', 'Arial']}}) chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.""" else: raise ValueError( "Unsupported library. Choose from 'matplotlib', 'seaborn', 'plotly', 'bokeh', 'ggplot', 'altair'." ) return template, instructions ```

ccccler commented 4 months ago

You can monkey path lida to make this work:

Add the two lines on top of your own python script.
from monkey_patch_lida import monkey_patch_lida

monkey_patch_lida()

from lida import Manager
lida = Manager(text_gen='openai')
...
在Mac测试的，你可以根据自己的系统选择一个可用的字段，如“微软雅黑”；

monkey_patch_lida.py
from lida.components import goal
from lida.components.scaffold import ChartScaffold
from lida.components.viz import vizgenerator
from lida.components.viz.vizgenerator import VizGenerator
from lida.datamodel import Goal

def monkey_patch_lida():
    # 替换为简短的中文，不然会超出字符数
    goal.SYSTEM_INSTRUCTIONS = '''
    你是一个经验丰富的数据分析师，可产生给定个数的关于数据的有见地的目标。
    你建议的可视化必须遵循可视化最佳实践（如，必须使用条形图而不是饼图来比较数量），并且要有意义（如，在适当的情况下在地图上绘制经纬度）。
    每个目标必须包括一个question、一个visualization（可视化必须参考摘要中的准确列字段）和一个rationale（使用数据集字段的理由以及我们将从可视化中学到什么）。
    每个目标都必须提到数据集摘要中的确切字段。'''

    goal.FORMAT_INSTRUCTIONS = '''
    输出必须是合法的JSON代码，必须使用以下格式，除了代码不要有任何解释，列表中需要有3到10个问题：
    [
        { "index": 0,  "question": "X的分布是什么？", "visualization": "X的直方图", "rationale": "这告诉我们 "} ..
    ]
```
'''

vizgenerator.system_prompt = '''
你是一个乐于助人的助手，擅长为可视化编写完美的代码。
给定一些代码模板，在给定数据集和所描述的目标的情况下，完成模板以生成可视化。
您编写的代码必须遵循可视化最佳实践，即满足指定目标，应用正确的转换，使用正确的可视化类型，使用正确数据编码，并使用正确的美学（例如，确保轴清晰可见）。
应用的转换必须正确，使用的字段必须正确。可视化代码必须正确，不得包含任何语法或逻辑错误（例如，必须考虑字段类型并正确使用它们）。
你必须首先为你将如何解决任务生成一个简短的计划，例如你将应用什么转换，例如如果你需要构建一个新的列，你将使用什么字段，你将采用什么可视化类型，你将运用什么美学，等等。
'''

ChartScaffold.get_template = get_template_py
old_get_template = ChartScaffold.get_template old_generate = VizGenerator.generate

def get_template_py(self, goal: Goal, library: str):

general_instructions = f"If the solution requires a single value (e.g. max, min, median, first, last etc), ALWAYS add a line (axvline or axhline) to the chart, ALWAYS with a legend containing the single value (formatted with 0.2F). If using a where semantic_type=date, YOU MUST APPLY the following transform before using that column i) convert date fields to date types using data[''] = pd.to_datetime(data[], errors='coerce'), ALWAYS use errors='coerce' ii) drop the rows with NaT values data = data[pd.notna(data[])] iii) convert field to right time format for plotting. ALWAYS make sure the x-axis labels are legible (e.g., rotate when needed). Solve the task carefully by completing ONLY the AND section. Given the dataset summary, the plot(data) method should generate a {library} chart ({goal.visualization}) that addresses this goal: {goal.question}. DO NOT WRITE ANY CODE TO LOAD THE DATA. The data is already loaded and available in the variable data."
general_instructions = f"不要编写任何代码来加载数据。数据已加载到变量data中。代码需遵循python的4个空格的缩进风格，保证代码可正常运行，不会报错。"
matplotlib_instructions = f" {general_instructions} DO NOT include plt.show(). The plot method must return a matplotlib object (plt). Think step by step.\n"

if library == "matplotlib":
    instructions = {
        "role": "assistant",
        "content": f"  {matplotlib_instructions}. "}
    template = \
        f"""
import matplotlib.pyplot as plt import pandas as pd
__doc = ''' 方案计划 1. .. ''' def plot(data: pd.DataFrame): # 只修改这里的代码，一定注意你的代码需遵循python的4个空格的缩进风格。 plt.title('{goal.question}', wrap=True) # save the plot as PNG file # plt.savefig("plot.png") return plt; plt.rcParams['font.family'] = ['Arial Unicode MS'] plt.rcParams['axes.unicode_minus'] = False # data = pd.read_csv('./xh_mobile.csv') chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.""" elif library == "seaborn": instructions = { "role": "assistant", "content": matplotlib_instructions} template = \ f""" import seaborn as sns import pandas as pd import matplotlib.pyplot as plt __doc = ''' 方案计划 1. .. ''' def plot(data: pd.DataFrame): # 只修改这里的代码，一定注意你的代码需遵循python的4个空格的缩进风格。 plt.title('{goal.question}', wrap=True) # save the plot as PNG file # plt.savefig("./imgs/plot.png") return plt; # data = pd.read_csv('./xh_mobile.csv') plt.rcParams['font.family'] = ['Arial Unicode MS'] # 用来正常显示中文标签 plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号 sns.set_style('whitegrid', {{'font.sans-serif': ['Arial Unicode MS', 'Arial']}}) chart = plot(data) # data already contains the data to be plotted. Always include this line. No additional code beyond this line.""" else: raise ValueError( "Unsupported library. Choose from 'matplotlib', 'seaborn', 'plotly', 'bokeh', 'ggplot', 'altair'." ) return template, instructions ```

请问我在读取带有中文的数据时，报错就显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte，这应该怎么修改？谢谢！

Shu-Ji commented 4 months ago

请问我在读取带有中文的数据时，报错就显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte，这应该怎么修改？谢谢！

我用的python3，csv里面有中文，没有报错。你看看是不是summary这一步报错了？

ccccler commented 4 months ago

请问我在读取带有中文的数据时，报错就显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte，这应该怎么修改？谢谢！

我用的python3，csv里面有中文，没有报错。你看看是不是summary这一步报错了？

是的，到summary这一步就出问题了。再往上一行的报错是 File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.cinit。请问你的数据也是csv吗？是标题和行名称里都有中文吗？

kishoretvk commented 3 months ago

can we use llama or mistral for summarization and cna we picasso or stable diffusion for image generation? ?

Shu-Ji commented 3 months ago

请问我在读取带有中文的数据时，报错就显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte，这应该怎么修改？谢谢！

我用的python3，csv里面有中文，没有报错。你看看是不是summary这一步报错了？

是的，到summary这一步就出问题了。再往上一行的报错是 File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.cinit。请问你的数据也是csv吗？是标题和行名称里都有中文吗？

我字段名没有中文，内容有中文

microsoft / lida

can the visualize model support Chinese in generaled image? #68

monkey_patch_lida.py

monkey_patch_lida.py