tensorflow / text

Making text a first-class citizen in TensorFlow.
https://www.tensorflow.org/beta/tutorials/tensorflow_text/intro
Apache License 2.0
1.21k stars 333 forks source link

Feature request: add layer for darts lookup table? #1225

Open jeongukjae opened 8 months ago

jeongukjae commented 8 months ago

In some cases, it can be more efficient and memory-efficient than hashtable in tensorflow.

It should be great if darts lookup table has following methods

cantonios commented 8 months ago

Is this a tf.text-specific request, or should it be filed against tensorflow?

Do you have a link for "darts lookup table"?

jeongukjae commented 8 months ago

Ah, sorry, I didn't specify the details. It's tensorflow-text specific request.

Darts is double array trie and we can use it like lookup table. You can check the basic interface here: https://github.com/s-yata/darts-clone/blob/master/doc/en/Interface.md#dictionary-class. Additionally, tensorflow text already has a dependency of darts-clone (used in wordpiece tokenizer, darts-clone is cloned repository of darts)

https://github.com/tensorflow/text/blob/b32645fbf1e4fd7e81d8d03fa2d2b4872e3a270d/WORKSPACE#L37-L45

Double array trie is performant and efficient data structure to store lots of strings and paired values, so it can be useful to train/serve with lots of vocabs. (like tens of milliions vocabs in the single model. it can be hard to use hash table because of the memory burden)

So I'm suggesting implementing the basic methods of the darts-clone's interface.