Grounded skill samples generated by the simple pipeline are missing context?

Just noticed this while documenting dataset formats in #236

In _gen_train_data() we are taking a dataset with question and response columns, and generating a training dataset in two different formats. (Ok, in the case of the simple pipeline, we actually parse question and response from output, but that's not super important here)

If the dataset also contains a context column, we append context to the question.

            user = _get_question_hack(synth_example)
            if len(synth_example.get("context", "")) > 0:
                user += "\n" + synth_example["context"]
            assistant = _unescape(_get_response_hack(synth_example))
            train_entry = {
                "system": _SYS_PROMPT,
                "user": _unescape(user),
                "assistant": assistant,
            }
            train_data.append(train_entry)
            sample = {
                "inputs": _unescape(user),
                "targets": assistant,
                "system": _SYS_PROMPT,
            }
            messages_data.append(_convert_to_messages(sample))

In the full pipeline for grounded skills, we do generate this context column based on the seed_context column.

In the simple pipeline for grounded skills, we are not including a context column at all. I suspect the intent was include the original seed context in each sample? If so, we'd need to add a DuplicateColumnsBlock that would copy seed_context to context?

instructlab / sdg

Grounded skill samples generated by the simple pipeline are missing context? #258