split metadata into multiple files

tensorlab / tensorfx

TensorFlow framework for training and serving machine learning models

Apache License 2.0

196 stars 41 forks source link

split metadata into multiple files #16

Open brandondutra opened 7 years ago

brandondutra commented 7 years ago

Having one file with all the vocabs can be a problem for large examples. I think this was a performance problem with a criteo sample.

It would be nice to have vocab files for each column. So if a "string to int" transforms is needed only for a few categorical columns, the vocab for every column does not need to be loaded.

nikhilk commented 7 years ago

Yes, agree.

I haven't fully grokked how vocab files work end-to-end ... wrt to setting up a hashtable from a file, so it works at training and prediction time, and how vocabs should be saved within a saved model. Perhaps this can be researched a bit unless you already know...

brandondutra commented 7 years ago

The structure data package reads the vocab file, and embeds it in the graph with index_table_from_tensor (but I think index_to_string_table_from_file would work fine). The vocab file then does not need to be saved with the exported graph.