huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

correctly mask bos in prompted ids #79

Closed sanchit-gandhi closed 5 months ago

sanchit-gandhi commented 5 months ago

This PR ensures the padding mask is correctly constructed for both the un-prompted and prompted cases.

Un-prompted

Given input ids of:

<bos>       a        b        c        d        e     <eos>

The corresponding labels are the right-shifted ids and the decoder input ids the first N-1 ids:

labels:          a        b        c        d        e     <eos>

                 ↑        ↑        ↑        ↑        ↑        ↑

input ids:    <bos>       a        b        c        d        e        

Prompted

For prompted ids of format:

<prev>    f        g        h        i     <bos>    a        b        c        d        e     <eos>

We should have:

labels:                                                   a        b        c        d        e     <eos>

                                                          ↑        ↑        ↑        ↑        ↑        ↑

input ids:    <prev>    f        g        h        i     <bos>     a        b        c        d        e          

=> the important aspect is that in the labels, we do not predict the <bos> token id, as was done prior to #77. The bug in #77 was that for un-prompted ids, we were also masking the first target label (a). This PR corrects this behaviour.