Closed alon-albalak closed 2 years ago
Unsure about this one. Will need to check. Please recheck from your side as well.
Me too, it's hard to tell what the intention was here. instruction_option_sampledata and instruction_binary_sampledata are handled differently, even though I believe they are meant to be handled the same.
Line 610 shows that instruction_option_sampledata will gather more samples for each task. Line 613 shows that instruction_binary_sampledata will only use the samples from the last task.
I think I understand now that my suggestion is also incorrect. The reason I found this is that the instruction_option task was creating over 100k samples, while instruction_binary had only 5k. 5k was what I set for both --instruction_option_size and --instruction_binary_size args.
I'll make another commit so that it has, what I believe is the desired behavior: sample --instruction_binary/option_sampledata from all tasks, but only a maximum of --instruction_binary/option_size
Instruction_binay task indeed needed fixing. In my experiments, I set instruction_option to 200 that led to around 2200 data points. With this new update, one can set instruction_option task size to 22 directly. Although I used 2200, one can try with 5000 points too
Sampling is currently broken for the instruction_option task as it currently returns the list of all possible samples for this task