Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`.

eaplatanios commented 8 months ago

(This PR is stacked on top of #1443 and so currently shows the changes for both PRs together)

The changes introduced in this PR are necessary if one wants to maintain the token IDs of the pre-existing AddedVocabulary and more importantly, they are necessary if one wants to build a fast tokenizer correctly based on a Python tokenizer.

Consider for example the Yi tokenizer. If one tries to load a fast tokenizer for Yi, then you will end up with token IDs 64000 and 64001 for the BOS and EOS tokens, respectively. That's because that tokenizer uses custom BOS and EOS tokens but assigns them known/pre-existing IDs via the added_tokens_decoder field. These IDs end up getting ignored when building the fast tokenizer due to the code block starting here. This is also evident from the recommendation of the Yi model authors to set use_fast=False as shown here.

That's at least partially because this library does not support building tokenizers using a pre-existing AddedVocabulary. This PR adds support for that. I plan to follow up with a PR in the transformers library after this PR is merged, updating that code block.

HuggingFaceDocBuilderDev commented 8 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker commented 8 months ago

That looks nice I'll have a look! Thanks for opening it!

eaplatanios commented 7 months ago

@ArthurZucker what happens with the Yi tokenizer is the following. The config of the tokenizer includes this field:

"added_tokens_decoder": {
    ...
    "1": {
        "content": "<|startoftext|>",
        "special": true
    },
    "2": {
        "content": "<|endoftext|>",
        "special": true
    },
    ...
}

So, when this tokenizer is loaded, it hits this code block. If you read that code, you'll notice that it takes the information included in the added_tokens_decoder field, it drops the token IDs entirely and then just looks up the special tokens in the underlying/base tokenizer's vocabulary. If they don't exist, it assigns new IDs to them which in this case are IDs 64000 and 64001. This causes code using this tokenizer to break because these IDs are invalid.

Instead, the recommendation when using the Yi model is to set use_fast=False as shown here to avoid this issue.

This PR adds functionality that will enable me to fix the underlying behavior in the linked code block from the transformers library so that we can use fast tokenizers with a wider set of models.

ArthurZucker commented 7 months ago

I understand your requirements but that is not the wait to go with tokenizers ! Tokenizers is aimed at being very restrained in the way you should interact with the object so I am not in favor of adding this. However, we can fix this in transformers. The idea is that if you are converting from a slow tokenizer, you should be properly adding the tokens in the convert_slow. The way we convert the tokenizer does not take into account the actual vocab.

    def vocab(self, proto):
        vocab = [
            ("<unk>", 0.0),
            ("<s>", 0.0),
            ("</s>", 0.0),
        ]
        vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
        return vocab

that is hard coded in the LlamaConverter. Yi chat probably manually changed the added_tokens_decoder but that can lead to a lot of issues for the tokenizers (as the backend is pretty much stateless, you would need to re-compile the normalized regex etc etc).

We should modify the def vocab(self, proto) to take into account the added_tokens_decoder at least for the begin token.

WDYT?

eaplatanios commented 7 months ago

Hmm so let me take a step back and summarize the changes here and also try to understand your suggested fix. First of all, I assume the changes introduced in #1443 are uncontroversial, right? Because that's just adding some functions that appear to be missing in the first place. If so, could you please approve that one so we can decouple the changes?

Regarding the main change introduced in this PR (i.e., set_added_tokens_decoder), I understand your point and I agree that this change is non-ideal. However, I do not understand your proposed fix as described here:

We should modify the def vocab(self, proto) to take into account the added_tokens_decoder at least for the begin token.

Could you please elaborate? Also, how does it relate to this code block that I linked earlier which seems broken for Yi (due to the reuse of pre-existing token IDs as I described earlier).

ArthurZucker commented 7 months ago

Of course! The idea is that when you convert a slow to a fast tokenizer in transformers, using from_slow will always re-build the vocab from scratch. But the current conversion always sets the first token and second tokens to <s> and </s>. While if you manually overwrite the tokenizer.json you can just replace these with whichever string you want. If we update the LlamaConverter to make sure it properly builds the vocab based on the original_tokenizer.added_tokens_decoder then you will not have a problem anymore. On of the main goals of the tokenizers is to be stateless, so the changes in this PR are against this.

eaplatanios commented 6 months ago

@ArthurZucker thanks! In that case, I can look into that fix for transformers. However, I'd still need #1443 to get merged for supporting our use case, separate from what's discussed in this PR. Could you please provide a review on that one? In the meantime, I'll go ahead and close this PR.

ArthurZucker commented 6 months ago

https://github.com/huggingface/transformers/pull/29797 is probably what you are looking for!

eaplatanios commented 6 months ago

@ArthurZucker that's great, thanks!

huggingface / tokenizers

Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`. #1444