instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
17 stars 30 forks source link

The generated dataset contains "Question" and "Answer" tags at the beginning of the response #103

Closed oindrillac closed 3 months ago

oindrillac commented 3 months ago

Upon testing the freeform flow on few examples, it looks like the user and assistant keys in the generated dataset in the simple pipeline flow contain "Question:" and "Answer:". This could be unexpected behavior of the simple prompt template

{"system": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "user": "Question: What is another way to say \"happy\"?", "assistant": "Answer: A synonym for \"happy\" is \"joyful\"."}
{"system": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "user": "Question: Sarah and Alex each have one brother. If Sarah and Alex are sisters, how many brothers do they share between them?", "assistant": "Answer: Sarah and Alex share one brother."}
russellb commented 3 months ago

Was this using the default setup (merlinite) ?

oindrillac commented 3 months ago

Ah this was on Mixtral, maybe that's why the odd behavior. Closing since default expected combo is merlinite + simple, and mixtral + full

russellb commented 3 months ago

Ah this was on Mixtral, maybe that's why the odd behavior. Closing since default expected combo is merlinite + simple, and mixtral + full

Yeah, I think if the system can run mixtral, we'd want them using the full pipeline.

We could add some validation on this. If we see the model is mixtral but the pipeline is "simple", it's probably worth emitting a warning message. What do you think?