Model generates incorrect VegaZero code for CSV input with specific columns

I encountered an issue while using the DeepvizLab/vrecs model (from huggingFace) for generating VegaZero code based on a CSV dataset. The model appears to generate incorrect or irrelevant responses when provided with a CSV dataset and a specific query.

I loaded a CSV file named score.csv with the following structure:

name,score-1,score-2
fef,88,32
efwf,78,43
efew,56,21

and here is my code：

# 加载 V-RECS 模型到指定设备
pipe = pipeline("text-generation", model="DeepvizLab/vrecs", device=device)
# 从CSV文件加载数据
csv_file = 'score.csv'
data = pd.read_csv(csv_file)
# 将数据集转换为 JSON 格式字符串，因为模型期望的是文本输入
data_json = data.to_json(orient='split')
# 定义查询提示
query = "Generate a bar chart that shows the value for each category."

# 将查询和数据集拼接成模型输入
prompt = f"DATASET: {data_json}\nQUERY: {query}"

print(data_json)

try:
    generated_code = pipe(prompt, max_length=500, num_return_sequences=1, truncation=True)[0].get('generated_text', 'Error: Text generation failed')
    # 输出生成的代码
    print(generated_code)
except Exception as e:
    print(f"An error occurred: {e}")

it worked but the answer is just incorrect... answer：

  attn_output = torch.nn.functional.scaled_dot_product_attention(
DATASET: {"columns":["name","score-1","score-2"],"index":[0,1,2],"data":[["fef",88,32],["efwf",78,43],["efew",56,21]]}
QUERY: Generate a bar chart that shows the value for each category.

    Let's think step by step.

    ## Request: 
    Show the total number from each location code. Plot them as bar chart.
    ## Dataset: 
    [('location_id', 'numeric'), ('street_address', 'categorical'), ('postal_code', 'categorical'), ('city', 'categorical'), ('state_province', 'categorical'), ('country_id', 'categorical')]

    ## Reasoning process:
    Step 1. Columns selection: 
    - X-axes [X]: The 'location_code' column is selected as it represents the different locations in the dataset.
- Y-axes [Y]: The 'count' aggregate function is used on the 'location_code' column to get the total number of each location code.
- Transform filter [F] and [G]: The 'group' transform is applied on the 'x' axis to group the data by location code.
    Step 2. Chart selection:- Aggregation function [AggFunction]: Count
- Color [Z]: Not specified
- Filtering function [F]: Not specified
- Grouping function [G]: location_code
- Bin function [B]: Not specified
- Sorting [S]: Not specified
- TopK [K]: Not specified
    Step 3. Columns math operation selection: Count

    ## Response: 
    Step 1. Vegazero visualization: mark bar  encoding x location_code y aggregate count location_code transform group x
    Step 2. Visualization explanation: 
        The 'location_code' column is chosen because it represents the different locations, which is what the user is interested in. The 'count' aggregate function is used to get the total number of each location code, which is what the user wants to see. The 'group' transform is applied to group

Environment: Python Version: 3.x CUDA Version: 12.2 GPU: NVIDIA 4060 Ti

Additional Notes:

The model might not be correctly parsing or understanding the provided dataset structure or the query context.
Suggestions on how to properly format the input or any necessary adjustments to improve the model's performance would be greatly appreciated.

the paper is great and thanks for your help!

@Darius18 Thank you for your interest in VRECS

Regarding the error, the problem is how you are passing the dataset.

To fix this problem and use it correctly, you should:

parse the dataset using the following function, which loads the dataset and converts it into the expected shape

def load_data(uploaded_file:pd.DataFrame)->Union[str, Tuple[list, pd.DataFrame]]:
"""
Load and preprocess data from an uploaded file.

Parameters
----------
uploaded_file : pd.DataFrame
    The uploaded file containing data to be loaded.

Returns
-------
Union[str, Tuple[List[Tuple[str, str]], pd.DataFrame]]
    A tuple containing:
    - A list of tuples with column names and their corresponding custom data types.
    - The loaded DataFrame with columns renamed to lowercase.
    In case of an error, returns a string with the error message.
"""

try:
    encoding = utils.detect_encoding(uploaded_file)
    delimiter = utils.detect_separator(uploaded_file, encoding)

    df = pd.read_csv(uploaded_file, encoding=encoding, on_bad_lines='skip', delimiter=delimiter)
    df = df.rename(columns=lambda x: x.lower())

    column_types = [(col, utils.map_dtype_to_custom(dtype)) for col, dtype in df.dtypes.items()]

    return column_types, df
except Exception as e:
    return f"Error loading file: {e}"

Then you should pass the column_types, from the previous function, along with your query using the following prompt template


Your task is to recommend and explain to a user the best visualization for a given dataset using the VegaZero template. 
Here are the available options: mark [T], encoding x [X], y [Y], aggregate [AggFunction], color [Z], transform filter [F], group [G], bin [B], sort [S], topk [K].

Let's think step by step.

Request:

{query}

Dataset:

{dataset}

Reasoning process:



However, if you want to avoid writing the code, I have fixed the readme to launch the V-RECS web app so that you can test it quickly. (All code is available in src folder along with the documentation in docs folder)

Thank you again for your interest, and I'm looking forward to hearing if you can fix it. Moreover, this is my first contribution so that any feedback would be great!

lucapodo / V-RECS