instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
12 stars 28 forks source link

Grounded skill samples generated by the simple pipeline are missing context? #258

Open markmc opened 2 weeks ago

markmc commented 2 weeks ago

Just noticed this while documenting dataset formats in #236

In _gen_train_data() we are taking a dataset with question and response columns, and generating a training dataset in two different formats. (Ok, in the case of the simple pipeline, we actually parse question and response from output, but that's not super important here)

If the dataset also contains a context column, we append context to the question.

            user = _get_question_hack(synth_example)
            if len(synth_example.get("context", "")) > 0:
                user += "\n" + synth_example["context"]
            assistant = _unescape(_get_response_hack(synth_example))
            train_entry = {
                "system": _SYS_PROMPT,
                "user": _unescape(user),
                "assistant": assistant,
            }
            train_data.append(train_entry)
            sample = {
                "inputs": _unescape(user),
                "targets": assistant,
                "system": _SYS_PROMPT,
            }
            messages_data.append(_convert_to_messages(sample))

In the full pipeline for grounded skills, we do generate this context column based on the seed_context column.

In the simple pipeline for grounded skills, we are not including a context column at all. I suspect the intent was include the original seed context in each sample? If so, we'd need to add a DuplicateColumnsBlock that would copy seed_context to context?

markmc commented 2 weeks ago

Another example of where I think we're missing context for grounded skills:

In datamixing.py:

def _convert_to_leaf_node_messages(sample: dict, sys_prompt: str):
    ...
    user_query = _unescape(_get_question_hack(sample))
    response = _unescape(_get_response_hack(sample))

    sample["id"] = str(uuid.uuid4())
    sample["messages"] = [
    {"content": sys_prompt, "role": "system"},
        {"content": user_query, "role": "user"},
    {"content": response, "role": "assistant"},
    ]

AIUI, we should be included context in the user message here?