Open LostRuins opened 2 days ago
Currently not - you may be interested in https://huggingface.co/blog/packing-with-FA2
Ah, that would explain the severely degraded performance when I packed a lot of unrelated samples together with a long context. It was getting super confused.
So I guess I have to train with packing=false to avoid any cross-domain instruction leakage.
Hi, I'd like to check, is there proper attention masking done when applying packing=true?
What I means that, within the same batch, say we have 2 independent/unrelated samples packed together.
[INSTRUCTION_1] [RESPONSE_1] [INSTRUCTION_2] [RESPONSE_2] [PAD]...
Do tokens for Sequence 1 (Instruction 1 + Response 1) only attend to Sequence 1 itself, and likewise for Sequence 2? Or does it "leak" i.e. SEQ1 tokens attends to SEQ2?