Closed hieuchi911 closed 3 days ago
We checked our original code. We set the "output" to the value of "gen_answer" before processing the generated data for SeqKD. Thanks for pointing out. We will clarify this in the README.
Gotcha thanks for the clarification
I'm running sequence level KD of llama. And in the first step of generating responses with teacher:
I observed a problem. Here
scripts/llama/tools/generate_data_seqkd.sh
will create an augmented dataset in a jsonl file, where all json objects are of this format:Later on when creating binary files to store tokenized version of this new dataset,
scripts/llama/tools/process_pseudo_data_seqkd.sh
only usesinstruction
,input
, andoutput
for tokenization, andgen_answer
is not used at all, while I believegen_answer
should be used instead ofoutput
Is this a bug?