Open andersonbcdefg opened 10 months ago
Hi @andersonbcdefg, sorry for the extremely late reply.
So the tokenizer in MLX Data is actually quite fast when it comes to smallish documents. It optimizes over the whole passed document so it is quite slower when passed such a huge text like the one above (while it obviously doesn't make sense to check the whole graph).
For example the wikitext benchmark (https://github.com/ml-explore/mlx-data/blob/c1204bce12ce495add1ed68338543cb4b5c5a595/benchmarks/comparative/wikitext/mlx_data.py) on my Mac tokenizes a few millions of tokens per second which should be more than enough for any use case.
Hmm, well the document in my example is only a few hundred characters. It's a batch of 500 of the same doc, but the doc is short so I'm not sure that optimizing over a large graph would explain the disparity in speed.
Oh sorry I kinda misunderstood the code snippet.
Having said that, I wouldn't say it is significantly slower than SPM. Running your benchmark with varying document size on my M2 air laptop I get the following comparison table with SPM
Doc length (chars) | Batch Tokens | MLX time / SPM time | MLX Tokens/s
-------------------+--------------+---------------------+-------------
57 | 14336 | 5.40 | 1098130.11
114 | 28672 | 8.25 | 1125880.12
172 | 43520 | 9.30 | 1168599.07
229 | 58368 | 9.89 | 1205666.60
287 | 72704 | 9.63 | 1182750.15
344 | 85504 | 11.0 | 1103767.26
401 | 100864 | 10.9 | 1125994.54
459 | 113664 | 10.7 | 1130119.24
516 | 125952 | 10.9 | 1074331.83
574 | 139264 | 10.5 | 1072074.781
Keep in mind that this is single core. So >1M tok/s per core I think is pretty reasonable for almost all use cases. We would of course appreciate PRs that improve that to reach the speed of SPM which is probably somewhere around 2M-3M tok/s per core on my machine.
Yeah I hope that it's able to be sped up! A 10x difference in speed makes a big difference esp. for offline data processing type workflows (I understand 1M tok/s is fine if you're feeding an LLM in real time, but tokenization is also important for batch processing!)
Sure I understand, and we should work on it however, this is still single core. When using the following pipeline on my M2 air it is 3x slower than SPM
dset = (
dx.stream_python_iterable(lambda: ({"doc": s.encode()} for s in random_texts))
.tokenize("doc", trie)
.prefetch(20, 4)
.sliding_window("doc", 512, 512)
.shape("doc", "length", 0)
.batch(128)
.prefetch(2, 1)
)
Under what circumstances is MLX supposed to provide a speedup over sentencepiece? In a naive test with the same SPM .model file, I'm able to tokenize 1000 batches in 13 seconds with sentencepiece, and it takes over 5 minutes with MLX. Hardware is M2 Macbook Pro with 64GB unified memory. Is the CharTrie tokenization only useful when paired with key_transform? Are there plans to add a "tokenize_batch" with better parallelization/concurrency?
Code for reference: