Open vandrw opened 2 months ago
Hey! Did you add this :
#[pymodule]
pub fn pre_tokenizers(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_class::<PyPreTokenizer>()?;
m.add_class::<PyByteLevel>()?;
m.add_class::<PyWhitespace>()?;
m.add_class::<PyWhitespaceSplit>()?;
m.add_class::<PySplit>()?;
m.add_class::<PyBertPreTokenizer>()?;
m.add_class::<PyMetaspace>()?;
m.add_class::<PyCharDelimiterSplit>()?;
m.add_class::<PyPunctuation>()?;
+ m.add_class::<PyCustomPreTokenizer>()?;
m.add_class::<PySequence>()?;
m.add_class::<PyDigits>()?;
m.add_class::<PyUnicodeScripts>()?;
Ok(())
}
@ArthurZucker Hi Arthur! Thanks for looking into this.
The goal is to create a separate project. If I'm not mistaken, the approach you suggested would require me to fork the tokenizers library and then distribute the version with my custom structs, right? This would be very inconvenient for users, since they would have to wait for me to update my code to the latest version of the library if other features were added.
What I'd like to suggest here is slightly different. Instead of constraining the logic of custom components to Python implementations, as seen in the example below, I would propose to also expose the pyo3 classes in a separate crate (i.e., PyPreTokenizer, PyNormalizer, PyModel, PyTrainer, ...). https://github.com/huggingface/tokenizers/blob/bfd9cdeefb6fb2fb9b5514f8a5fad6d7263a69d6/bindings/python/examples/custom_components.py#L11-L16
If a crate such as tokenizers_pyo3
were available, users could create modules that are compatible with the library and others that build on it (e.g., transformers
) directly in Rust. Is this something the community would consider to be useful? If so, I could take a look at implementing this in a few weeks.
Hi! Thanks a lot for all the work put in this library!
I am interested in moving a custom pre-tokenizer I have created as a Python class via PyO3. Here is an example:
I am now interested in using this in a Python script, and after looking at the bindings code present in the library, I tried to implement the following:
However, I could not seem to be able to import/call the PyPreTokenizer struct described in the following file: https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/bindings/python/src/pre_tokenizers.rs#L38
Is there a way to achieve this without having to reimplement the functionality of the PyPreTokenizer struct in my project? For example, having a way to call
use tokenizers_pyo3::PyPreTokenizer
?