Open ndvbd opened 6 years ago
The tensorflow serving application uses grpc, however t2t can use the google sdk to communicate with ml engine (at least the code is there - I can't vouch it works yet):
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/serving/serving_utils.py#L88
I am grateful that they are trying to make my life easier but I understand your confusion.
Looking at the code it appears you need to set the --cloud_mlengine_model_name=
flag when querying for data.
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/serving/query.py#L43
Well, it works, but I know getting: "Prediction server is out of memory, possibly because model size is too big." Altough my SavedModel size is 237MB. Is there a way to take an already-trained model and either quantise the weights or change the int64 to int32?
You can use the t2t-exporter executable to export your snapshot in preparation for serving. It decreased my model size by about 90%.
I already used tensor2tensor.serving.export. It took the 684mb and shrinked it to 237MB (66% decrease). But I need more. Is there a way to configure it to do quantization or converting int64 to int32?
@ndvbd hi, have you got the way to do quantization for tensor2tensor models?
Haven't had the time to deal with it unfortunately.
Serving locally using tensorflow_model_server works fine. I've put an exported model/version on Google Cloud ML Engine. The question is how do I set the query.py to use a remote server instead of local one? This is the function in query.py that define the local port and host (which can be remote):
I believe it uses GRPC. Can the Cloud ML use GRPC? If not, and we must use JSON, where in the t2t code I can set to send data in JSON format to the Cloud ML and parse the response from JSON?