[MiniLLM] teacher generated responses `gen_answer` not used in seqKD

hieuchi911 commented 1 month ago

I'm running sequence level KD of llama. And in the first step of generating responses with teacher:

Generate responses with the teacher:
```
bash scripts/llama/tools/generate_data_seqkd.sh /PATH/TO/MiniLLM
bash scripts/llama/tools/process_pseudo_data_seqkd.sh /PATH/TO/MiniLLM
```
I observed a problem. Here scripts/llama/tools/generate_data_seqkd.sh will create an augmented dataset in a jsonl file, where all json objects are of this format:
```
{
"instruction": "...",    # the instruction
"prompt": "...",    # the instruction prompt including the input data
"input": "...",    # the input data
"output": "...",    # the ground truth
"gen_answer": "...",    # the teacher generated response
}
```
Later on when creating binary files to store tokenized version of this new dataset, scripts/llama/tools/process_pseudo_data_seqkd.sh only uses instruction, input, and output for tokenization, and gen_answer is not used at all, while I believe gen_answer should be used instead of output

Is this a bug?

t1101675 commented 1 month ago

We checked our original code. We set the "output" to the value of "gen_answer" before processing the generated data for SeqKD. Thanks for pointing out. We will clarify this in the README.

hieuchi911 commented 1 month ago

Gotcha thanks for the clarification

microsoft / LMOps

[MiniLLM] teacher generated responses `gen_answer` not used in seqKD #250