huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

[`pre_tokenizers`] Fix sentencepiece based Metaspace #1357

Closed ArthurZucker closed 7 months ago

ArthurZucker commented 9 months ago

Fix the issue with meta spaces that applies the pre_tokenization to all the sub strings.

use tokenizers::pre_tokenizers::metaspace::{Metaspace, PrependScheme};
use tokenizers::{PreTokenizedString, PreTokenizer};
use regex::Regex;
let pretok = Metaspace::new_with_prepend_scheme('▁', true, PrependScheme::Always)
let mut pretokenized = PreTokenizedString::from("Hey my friend <s>how▁are you");
let re_ref = Regex::new(r"(<s>)").unwrap();
pretokenized
  .split(|_, sequence| sequence.split(&re_ref, SplitDelimiterBehavior::Isolated))
  .expect("AddedVocabulary bad split");
pretok.pre_tokenize(&mut pretokenized).unwrap();

with legacy:

vec![("▁Hey", (0, 6)),("▁my", (6, 11)),("▁friend", (11, 20)),("▁", (20, 23)),("<s>", (23, 26)),("how", (26, 29)),("▁are", (29, 35)),("▁you", (35, 41))]);

without legacy:

 vec![("▁Hey", (0, 6)),("▁my", (6, 11)),("▁friend", (11, 20)),("▁", (20, 23)),("▁<s>", (23, 29)),("▁how", (29, 35)),("▁are", (35, 41)),("▁you", (41, 47))]);
HuggingFaceDocBuilderDev commented 9 months ago

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker commented 9 months ago

Serialization is not working properly yet! Will have to fix this

ArthurZucker commented 7 months ago

On it!