instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
9 stars 27 forks source link

Traceback on `Dataset.from_padas(df)` in e2e CI test of full pipeline #90

Open russellb opened 1 week ago

russellb commented 1 week ago

I've been working on an end-to-end CI job that includes the full SDG pipeline. I saw this exception occur in one of the test runs.

https://github.com/russellb/ilab-runner/actions/runs/9830002619/job/27135724673

INFO 2024-07-07 20:24:16,994 pipeline.py:48: generate Running block: gen_mmlu_knowledge
INFO 2024-07-07 20:24:16,994 pipeline.py:49: generate Dataset({
    features: ['task_description', 'domain', 'document', 'icl_query_1', 'icl_response_1', 'icl_query_2', 'icl_response_2', 'icl_query_3', 'icl_response_3'],
    num_rows: 144
})
INFO 2024-07-07 20:51:11,456 pipeline.py:48: generate Running block: gen_knowledge
INFO 2024-07-07 20:51:11,456 pipeline.py:49: generate Dataset({
    features: ['task_description', 'domain', 'document', 'icl_query_1', 'icl_response_1', 'icl_query_2', 'icl_response_2', 'icl_query_3', 'icl_response_3', 'mmlubench_question', 'mmlubench_answer', '__index_level_0__'],
    num_rows: 118
})
Traceback (most recent call last):
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/bin/ilab", line 8, in <module>
    sys.exit(ilab())
             ^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/instructlab/utils.py", line 551, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/instructlab/data/generate.py", line [194](https://github.com/russellb/ilab-runner/actions/runs/9830002619/job/27135724673#step:14:195), in generate
    generate_data(
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/instructlab/sdg/generate_data.py", line 281, in generate_data
    new_generated_data = sdg.generate(ds)
                         ^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/instructlab/sdg/sdg.py", line 19, in generate
    dataset = pipeline.generate(dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 58, in generate
    dataset = self._drop_duplicates(dataset, cols=drop_duplicates_cols)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 27, in _drop_duplicates
    return Dataset.from_pandas(df)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/datasets/arrow_dataset.py", line 883, in from_pandas
    return cls(table, info=info, split=split)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/datasets/arrow_dataset.py", line 717, in __init__
    self._data = self.data.cast(self.info.features.arrow_schema)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/datasets/table.py", line 860, in cast
    return InMemoryTable(table_cast(self.table, *args, **kwargs))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/datasets/table.py", line 2302, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/ilab-runner/ilab-runner/venv/lib64/python3.11/site-packages/datasets/table.py", line 2256, in cast_table_to_schema
    raise CastError(
datasets.table.CastError: Couldn't cast
task_description: string
domain: string
document: string
icl_query_1: string
icl_response_1: string
icl_query_2: string
icl_response_2: string
icl_query_3: string
icl_response_3: string
mmlubench_question: string
mmlubench_answer: string
__index_level_0__: int64
question: string
response: string
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + [207](https://github.com/russellb/ilab-runner/actions/runs/9830002619/job/27135724673#step:14:208)0
to
{'task_description': Value(dtype='string', id=None), 'domain': Value(dtype='string', id=None), 'document': Value(dtype='string', id=None), 'icl_query_1': Value(dtype='string', id=None), 'icl_response_1': Value(dtype='string', id=None), 'icl_query_2': Value(dtype='string', id=None), 'icl_response_2': Value(dtype='string', id=None), 'icl_query_3': Value(dtype='string', id=None), 'icl_response_3': Value(dtype='string', id=None), 'mmlubench_question': Value(dtype='string', id=None), 'mmlubench_answer': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), 'response': Value(dtype='string', id=None)}
because column names don't match
russellb commented 1 week ago

Same error happened again here -- https://github.com/russellb/ilab-runner/actions/runs/9831900354/job/27139983196

Note that after seeing how many knowledge samples we're processing, I switched the knowledge sample we're testing: https://github.com/instructlab/instructlab/pull/1620

I wanted to note that in case the problem disappears now that I put a smaller knowledge example in place to speed up CI

russellb commented 1 week ago

Another variation of a similar error -- https://github.com/instructlab/sdg/actions/runs/9846627527/job/27184863665

looks like this is #99

INFO 2024-07-08 20:59:39,250 pipeline.py:49: generate Dataset({
    features: ['task_description', 'seed_context', 'seed_question', 'seed_response', 'context'],
    num_rows: 62
})
ERROR 2024-07-08 20:59:39,250 block.py:37: _validate Missing key: 'num_samples'
Traceback (most recent call last):
  File "/actions-runner/_work/sdg/sdg/venv/bin/ilab", line 8, in <module>
    sys.exit(ilab())
             ^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/click/core.py", line [143](https://github.com/instructlab/sdg/actions/runs/9846627527/job/27184863665#step:14:144)4, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/instructlab/utils.py", line 551, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/instructlab/data/generate.py", line 194, in generate
    generate_data(
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/instructlab/sdg/generate_data.py", line 283, in generate_data
    new_generated_data = sdg.generate(ds)
                         ^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/instructlab/sdg/sdg.py", line 19, in generate
    dataset = pipeline.generate(dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 58, in generate
    dataset = self._drop_duplicates(dataset, cols=drop_duplicates_cols)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/sdg/sdg/venv/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 25, in _drop_duplicates
    df = dataset.to_pandas()
         ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'to_pandas'
oindrillac commented 1 week ago

The above error is on the freeform flow so that is not related to #99