Closed haixuanTao closed 3 months ago
Adding a default implementation for __str__ and __repr__ for Tokenizer.
__str__
__repr__
>>> from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, processors >>> from tokenizers.implementations import BaseTokenizer >>> toki = Tokenizer(models.BPE()) >>> print(toki) <tokenizers.Tokenizer object at 0x7d687d32bc30>
>>> from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, processors >>> from tokenizers.implementations import BaseTokenizer >>> toki = Tokenizer(models.BPE()) >>> print(toki) TokenizerImpl { normalizer: None, pre_tokenizer: None, model: PyModel { model: RwLock { data: BPE( BPE { dropout: None, unk_token: None, continuing_subword_prefix: None, end_of_word_suffix: None, fuse_unk: false, byte_fallback: false, vocab: 0, merges: 0, ignore_merges: false, }, ), poisoned: false, .. }, }, post_processor: None, decoder: None, added_vocabulary: AddedVocabulary { added_tokens_map: {}, added_tokens_map_r: {}, added_tokens: [], special_tokens: [], special_tokens_set: {}, split_trie: ( AhoCorasick( dfa::DFA( D 000000: \x00 => 0 F 000001: >000002: \x00 => 2 000003: \x00 => 0 match kind: LeftmostLongest prefilter: false state length: 4 pattern length: 0 shortest pattern length: 18446744073709551615 longest pattern length: 0 alphabet length: 1 stride: 1 byte classes: ByteClasses(0 => [0-255]) memory usage: 16 ) , ), [], ), split_normalized_trie: ( AhoCorasick( dfa::DFA( D 000000: \x00 => 0 F 000001: >000002: \x00 => 2 000003: \x00 => 0 match kind: LeftmostLongest prefilter: false state length: 4 pattern length: 0 shortest pattern length: 18446744073709551615 longest pattern length: 0 alphabet length: 1 stride: 1 byte classes: ByteClasses(0 => [0-255]) memory usage: 16 ) , ), [], ), encode_special_tokens: false, }, truncation: None, padding: None, }
Hope this helps :)
Open for any critics and the representation or implementation
Inspired by https://github.com/dora-rs/dora/pull/503
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Closing in favor of #1542!
Adding a default implementation for
__str__
and__repr__
for Tokenizer.Test it out
Before
After
Hope this helps :)
Open for any critics and the representation or implementation
Inspired by https://github.com/dora-rs/dora/pull/503