google-gemini / generative-ai-python

The official Python library for the Google Gemini API
https://pypi.org/project/google-generativeai/
Apache License 2.0
1.63k stars 323 forks source link

`create_tuned_model ` fails when training_data is CSV #562

Closed nidhinpd-YML closed 2 months ago

nidhinpd-YML commented 2 months ago

Description of the bug:

As mentioned in the docs, I tried uploading CSV (as a str file path and pathlib.Path object) as training data to my tuned model. I tried with JSON and it works fine. But when I tried with CSV it shows error.

Code tried as str file path

name = f'generate-num-{random.randint(0,10000)}'

operation = genai.create_tuned_model(
    # You can use a tuned model here too. Set `source_model="tunedModels/..."`
    source_model=base_model.name,
    # Put csv file path"
    training_data=/content/my_file.csv',
    id = name,
    epoch_count = 100,
    batch_size=4,
    learning_rate=0.001,
)

Code tried as pathlib.Path object

from pathlib import Path

name = f'generate-num-{random.randint(0,10000)}'

operation = genai.create_tuned_model(
    # You can use a tuned model here too. Set `source_model="tunedModels/..."`
    source_model=base_model.name,
    # Put csv file path"
    training_data=Path('/content/my_file.csv'),
    id = name,
    epoch_count = 100,
    batch_size=4,
    learning_rate=0.001,
)

CSV file

text_input,output
1,2
3,4
-3,-2
twenty two,twenty three
two hundred,two hundred one
ninety nine,one hundred
8,9
-98,-97
1,000,1,001
10,100,000,10,100,001
thirteen,fourteen
eighty,eighty one
one,two
three,four
seven,eight

Error stack

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-29-c98842a796f6> in <cell line: 5>()
      3 name = f'generate-num-{random.randint(0,10000)}'
      4 
----> 5 operation = genai.create_tuned_model(
      6     # You can use a tuned model here too. Set `source_model="tunedModels/..."`
      7     source_model=base_model.name,

3 frames
/usr/local/lib/python3.10/dist-packages/google/generativeai/models.py in create_tuned_model(source_model, training_data, id, display_name, description, temperature, top_p, top_k, epoch_count, batch_size, learning_rate, input_key, output_key, client, request_options)
    338         )
    339 
--> 340     training_data = model_types.encode_tuning_data(
    341         training_data, input_key=input_key, output_key=output_key
    342     )

/usr/local/lib/python3.10/dist-packages/google/generativeai/types/model_types.py in encode_tuning_data(data, input_key, output_key)
    257             with f:
    258                 data = csv.DictReader(content)
--> 259                 return _convert_iterable(data, input_key, output_key)
    260 
    261     if hasattr(data, "keys"):

/usr/local/lib/python3.10/dist-packages/google/generativeai/types/model_types.py in _convert_iterable(data, input_key, output_key)
    310     new_data = list()
    311     for example in data:
--> 312         example = encode_tuning_example(example, input_key, output_key)
    313         new_data.append(example)
    314     return protos.Dataset(examples=protos.TuningExamples(examples=new_data))

/usr/local/lib/python3.10/dist-packages/google/generativeai/types/model_types.py in encode_tuning_example(example, input_key, output_key)
    322         example = protos.TuningExample(text_input=a, output=b)
    323     else:  # dict
--> 324         example = protos.TuningExample(text_input=example[input_key], output=example[output_key])
    325     return example
    326 

KeyError: 'text_input'

I noticed that during the last step of CSV processing in encode_tuning_example the CSV data goes into the dict section and raises key_error.

Actual vs expected behavior:

Actual

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-29-c98842a796f6> in <cell line: 5>()
      3 name = f'generate-num-{random.randint(0,10000)}'
      4 
----> 5 operation = genai.create_tuned_model(
      6     # You can use a tuned model here too. Set `source_model="tunedModels/..."`
      7     source_model=base_model.name,

3 frames
/usr/local/lib/python3.10/dist-packages/google/generativeai/models.py in create_tuned_model(source_model, training_data, id, display_name, description, temperature, top_p, top_k, epoch_count, batch_size, learning_rate, input_key, output_key, client, request_options)
    338         )
    339 
--> 340     training_data = model_types.encode_tuning_data(
    341         training_data, input_key=input_key, output_key=output_key
    342     )

/usr/local/lib/python3.10/dist-packages/google/generativeai/types/model_types.py in encode_tuning_data(data, input_key, output_key)
    257             with f:
    258                 data = csv.DictReader(content)
--> 259                 return _convert_iterable(data, input_key, output_key)
    260 
    261     if hasattr(data, "keys"):

/usr/local/lib/python3.10/dist-packages/google/generativeai/types/model_types.py in _convert_iterable(data, input_key, output_key)
    310     new_data = list()
    311     for example in data:
--> 312         example = encode_tuning_example(example, input_key, output_key)
    313         new_data.append(example)
    314     return protos.Dataset(examples=protos.TuningExamples(examples=new_data))

/usr/local/lib/python3.10/dist-packages/google/generativeai/types/model_types.py in encode_tuning_example(example, input_key, output_key)
    322         example = protos.TuningExample(text_input=a, output=b)
    323     else:  # dict
--> 324         example = protos.TuningExample(text_input=example[input_key], output=example[output_key])
    325     return example
    326 

KeyError: 'text_input'

Expected

Any other information you'd like to share?

No response

Gunand3043 commented 2 months ago

@nidhinpd-YML

There might be some formatting issues with the CSV file you provided. I tried both of the code snippets you shared, passing a string file path and using a pathlib.Path object, and both worked.

You can also use a URL for a CSV file. Try the code below and let me know if you’re still facing any issues.

import random
name = f'generate-num-{random.randint(0,10000)}'

operation = genai.create_tuned_model(
    # You can use a tuned model here too. Set `source_model="tunedModels/..."`
    source_model=base_model.name,
    # Put csv file path"
    training_data='https://docs.google.com/spreadsheets/d/1Sixq4JkYGCp1tJu0KukI6lLIZZ3kivvIq-pHcSiq6QA/edit?usp=sharing',
    id = name,
    epoch_count = 100,
    batch_size=4,
    learning_rate=0.001,
)
nidhinpd-YML commented 2 months ago

@Gunand3043 Thanks. It was the CSV formatting issue.