huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.3k stars 712 forks source link

Add an article on neural "tokenization" #2154

Closed apehex closed 1 month ago

apehex commented 3 months ago

Hey there :space_invader:

Here's an article on why and how to replace current tokenizers.

The model behind it is called tokun: it specializes in text embeddings. It produces much denser and more meaningful vectors than traditional tokenizers.

The link to Hugging Face (end of article) is not yet valid: I have to export my tf model before :)

apehex commented 3 months ago

BTW I've written notebooks too (training and demo)