Serialize Tokenization process

googleinterns / amaranth

Apache License 2.0

2 stars 0 forks source link

Serialize Tokenization process #36

Closed tommylau-exe closed 4 years ago

tommylau-exe commented 4 years ago

Although the ML model can be easily serialized using Tensorflow functions, the TextVectorization layer we were using previously cannot be. A new method of serializing this process must be found, and it must be compatible with javascript.

tommylau-exe commented 4 years ago

The two big options I see here are using protobufs or using json/pickling. Protobufs may be more robust and smaller in size, but also more difficult to set up. Json/pickling would be a lot easier to implement, but may result in files that are too large (the entire input dictionary must be serialized). I'm going to start with pickling and seeing how it goes.

tommylau-exe commented 4 years ago

As it turns out, the JSON file is over 100kB smaller than the pickled file. Since pickled files are binary I assumed it would be smaller. Oh well, the JSON file is still relatively small at 491kB, so JSON it is! Better for the Chrome extension too, as it doesn't need to load any 3rd party libraries to read the file.