Is there proper attention masking done when applying packing=true?

LostRuins commented 2 days ago

Hi, I'd like to check, is there proper attention masking done when applying packing=true?

What I means that, within the same batch, say we have 2 independent/unrelated samples packed together.

Instruction 1: 
What is 5+5?
Response 1: 
Answer is 10
Instruction 2: 
What is the capital of Japan?
Response 2: 
Tokyo

[INSTRUCTION_1] [RESPONSE_1] [INSTRUCTION_2] [RESPONSE_2] [PAD]...

Do tokens for Sequence 1 (Instruction 1 + Response 1) only attend to Sequence 1 itself, and likewise for Sequence 2? Or does it "leak" i.e. SEQ1 tokens attends to SEQ2?

danielhanchen commented 1 day ago

Currently not - you may be interested in https://huggingface.co/blog/packing-with-FA2

LostRuins commented 1 day ago

Ah, that would explain the severely degraded performance when I packed a lot of unrelated samples together with a long context. It was getting super confused.

So I guess I have to train with packing=false to avoid any cross-domain instruction leakage.

unslothai / unsloth

Is there proper attention masking done when applying packing=true? #1207