Open jeongukjae opened 8 months ago
Is this a tf.text-specific request, or should it be filed against tensorflow?
Do you have a link for "darts lookup table"?
Ah, sorry, I didn't specify the details. It's tensorflow-text specific request.
Darts is double array trie and we can use it like lookup table. You can check the basic interface here: https://github.com/s-yata/darts-clone/blob/master/doc/en/Interface.md#dictionary-class. Additionally, tensorflow text already has a dependency of darts-clone (used in wordpiece tokenizer, darts-clone is cloned repository of darts)
https://github.com/tensorflow/text/blob/b32645fbf1e4fd7e81d8d03fa2d2b4872e3a270d/WORKSPACE#L37-L45
Double array trie is performant and efficient data structure to store lots of strings and paired values, so it can be useful to train/serve with lots of vocabs. (like tens of milliions vocabs in the single model. it can be hard to use hash table because of the memory burden)
So I'm suggesting implementing the basic methods of the darts-clone's interface.
In some cases, it can be more efficient and memory-efficient than hashtable in tensorflow.
It should be great if darts lookup table has following methods