huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.89k stars 768 forks source link

Custom fast PreTokenizer, ported via PyO3 to Python #1566

Open vandrw opened 2 months ago

vandrw commented 2 months ago

Hi! Thanks a lot for all the work put in this library!

I am interested in moving a custom pre-tokenizer I have created as a Python class via PyO3. Here is an example:

use tokenizers::tokenizer::{normalizer::Range, PreTokenizedString, PreTokenizer, Result};
use tokenizers::utils::macro_rules_attribute;
use tokenizers::impl_serde_type;

fn get_example_ranges(input: &str)  -> Result<Vec<(usize, usize)>> {
    Ok(vec![(0, 2)])
}

#[derive(Clone, Debug, PartialEq, Eq)]
#[macro_rules_attribute(impl_serde_type!)]
pub struct CustomPreTokenizer;

impl Default for CustomPreTokenizer {
    fn default() -> Self {
        Self
    }
}

impl PreTokenizer for CustomPreTokenizer {
    fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()> {
        pretokenized.split(|_, normalized| {
            let ranges = get_example_ranges(normalized.get())?;
            Ok(ranges
                .into_iter()
                .map(|item| {
                    normalized
                        .slice(Range::Normalized(item.0..item.1))
                        .expect("Invalid input")
                })
                .collect::<Vec<_>>())
            })
    }
}

I am now interested in using this in a Python script, and after looking at the bindings code present in the library, I tried to implement the following:

#[pyclass(extends=PyPreTokenizer, name = "CustomPreTokenizer")]
pub struct PyCustomPreTokenizer {}
#[pymethods]
impl PyCustomPreTokenizer {
    #[new]
    #[pyo3(text_signature = "(self)")]
    fn new() -> (Self, PyPreTokenizer) {
        (PyCustomPreTokenizer {}, CustomPreTokenizer {}.into())
    }
}

However, I could not seem to be able to import/call the PyPreTokenizer struct described in the following file: https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/bindings/python/src/pre_tokenizers.rs#L38

Is there a way to achieve this without having to reimplement the functionality of the PyPreTokenizer struct in my project? For example, having a way to call use tokenizers_pyo3::PyPreTokenizer?

ArthurZucker commented 1 month ago

Hey! Did you add this :

#[pymodule]
pub fn pre_tokenizers(m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_class::<PyPreTokenizer>()?;
    m.add_class::<PyByteLevel>()?;
    m.add_class::<PyWhitespace>()?;
    m.add_class::<PyWhitespaceSplit>()?;
    m.add_class::<PySplit>()?;
    m.add_class::<PyBertPreTokenizer>()?;
    m.add_class::<PyMetaspace>()?;
    m.add_class::<PyCharDelimiterSplit>()?;
    m.add_class::<PyPunctuation>()?;
+   m.add_class::<PyCustomPreTokenizer>()?;
    m.add_class::<PySequence>()?;
    m.add_class::<PyDigits>()?;
    m.add_class::<PyUnicodeScripts>()?;
    Ok(())
}
vandrw commented 1 month ago

@ArthurZucker Hi Arthur! Thanks for looking into this.

The goal is to create a separate project. If I'm not mistaken, the approach you suggested would require me to fork the tokenizers library and then distribute the version with my custom structs, right? This would be very inconvenient for users, since they would have to wait for me to update my code to the latest version of the library if other features were added.

What I'd like to suggest here is slightly different. Instead of constraining the logic of custom components to Python implementations, as seen in the example below, I would propose to also expose the pyo3 classes in a separate crate (i.e., PyPreTokenizer, PyNormalizer, PyModel, PyTrainer, ...). https://github.com/huggingface/tokenizers/blob/bfd9cdeefb6fb2fb9b5514f8a5fad6d7263a69d6/bindings/python/examples/custom_components.py#L11-L16

If a crate such as tokenizers_pyo3 were available, users could create modules that are compatible with the library and others that build on it (e.g., transformers) directly in Rust. Is this something the community would consider to be useful? If so, I could take a look at implementing this in a few weeks.