huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

Added ability to inspect a 'Sequence' decoder and the `AddedVocabulary`. #1443

Closed eaplatanios closed 3 months ago

eaplatanios commented 5 months ago

This PR is in similar spirit to #1341 and adds a couple more functions that allow one to construct a modified version of an existing Tokenizer. I've followed the existing style and conventions for newly introduced functions.

HuggingFaceDocBuilderDev commented 5 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

eaplatanios commented 5 months ago

@ArthurZucker @Narsil gentle ping on this one and on #1444.

eaplatanios commented 4 months ago

@ArthurZucker @Narsil gentle ping about this PR. This PR should not be controversial and is in similar spirit to #1341 (and has the same motivation).

ArthurZucker commented 3 months ago

really sorry about all the delays, lot happening on transformers, I'll free some time

ArthurZucker commented 3 months ago

Also can you add tests for set and get? 🤗

eaplatanios commented 2 months ago

Sorry this fell through the cracks a bit over the past couple of weeks and I just saw the last couple of comments. Thanks for approving and merging this!

ArthurZucker commented 3 weeks ago

Thanks for your contribution 🤗