Use hashing trick for HashSetGazetteer, HashMapStemmer and HashMapWordClusterer and replace raw string keys by i32 keys
HashMapWordClusterer: try to load word clusters as u16 with a fallback to String for all values as soon as one value can't be converted to u16
In English, with the current word clusters included in the resources, this results in a constant 25MB gain in memory.
For other languages without word clusters, the expected gain is between 0.5MB and 1MB.
Backward compatibility
The new implementation is backward compatible. Old word clusters, which typically are stored like hierarchical binary paths of the form "10001011001", can still be loaded. In this case, clusters will be loaded as strings.
New word clusters, introduced in https://github.com/snipsco/snips-nlu-language-resources/pull/33, will benefit from this improved implementation, as all clusters are u16-like.
Description
HashSetGazetteer
,HashMapStemmer
andHashMapWordClusterer
and replace raw string keys by i32 keysHashMapWordClusterer
: try to load word clusters asu16
with a fallback toString
for all values as soon as one value can't be converted tou16
In English, with the current word clusters included in the resources, this results in a constant 25MB gain in memory. For other languages without word clusters, the expected gain is between 0.5MB and 1MB.
Backward compatibility The new implementation is backward compatible. Old word clusters, which typically are stored like hierarchical binary paths of the form "10001011001", can still be loaded. In this case, clusters will be loaded as strings. New word clusters, introduced in https://github.com/snipsco/snips-nlu-language-resources/pull/33, will benefit from this improved implementation, as all clusters are u16-like.