Open markmc opened 2 weeks ago
Another example of where I think we're missing context for grounded skills:
In datamixing.py
:
def _convert_to_leaf_node_messages(sample: dict, sys_prompt: str):
...
user_query = _unescape(_get_question_hack(sample))
response = _unescape(_get_response_hack(sample))
sample["id"] = str(uuid.uuid4())
sample["messages"] = [
{"content": sys_prompt, "role": "system"},
{"content": user_query, "role": "user"},
{"content": response, "role": "assistant"},
]
AIUI, we should be included context in the user message here?
Just noticed this while documenting dataset formats in #236
In
_gen_train_data()
we are taking a dataset withquestion
andresponse
columns, and generating a training dataset in two different formats. (Ok, in the case of thesimple
pipeline, we actually parsequestion
andresponse
fromoutput
, but that's not super important here)If the dataset also contains a
context
column, we appendcontext
to thequestion
.In the full pipeline for grounded skills, we do generate this
context
column based on theseed_context
column.In the simple pipeline for grounded skills, we are not including a
context
column at all. I suspect the intent was include the original seed context in each sample? If so, we'd need to add aDuplicateColumnsBlock
that would copyseed_context
tocontext
?