riga / tfdeploy

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.
http://tfdeploy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
352 stars 38 forks source link

does tfdeploy support multiple output in a graph? #20

Closed ugtony closed 7 years ago

ugtony commented 7 years ago

In the example, tfdeploy get the result of y1=W1x1+b1 by

result1 = y1.eval({x1: batch})

If I have a graph with two output y2=W2(W1x+b1)+b2, and y3=W3(W1x+b1)+b3, in tensorflow I can use

sess.run([y2, y3])

to get y2 and y3 simutaneously while avoiding redundant computation(of y1=W1x1+b1).

is it possible to do the same thing with tfdeploy? or I have to use two commands like below

result2 = y2.eval({x1: batch})
result3 = y3.eval({x1: batch})
riga commented 7 years ago

Hi @ugtony,

yes, this is also possible in tfdeploy, although this feature is quite hidden as there's no such thing as a session object that can evaluate tensors simultaneously.

Per eval invocation, the intermediate results of all depending tensors and ops are cached.

The actual signature of eval is:

eval(feed_dict=None, _uuid=None)

_uuid is used for caching, initially set when None and passed to all depending eval calls. So all you have to do is:

from uuid import uuid4

...

uuid = uuid4()
result2 = y2.eval({x1: batch}, uuid)
result3 = y3.eval({x1: batch}, uuid)

I only tested this feature, but never had to use it productively, so feedback is appreciated ;)

ugtony commented 7 years ago

Thanks! Glad to know there is a caching feature.

In my example, is it correct to simply use the function add(...) twice to create the tfdeploy model?

# setup tfdeploy (only when creating models)
...
# build your graph
...
y2 = tf.nn.softmax(tf.matmul(x1, W2) + b2, name="output2")
y3 = tf.nn.softmax(tf.matmul(x1, W3) + b3, name="output3")

# use add twice to create tfdeploy model
model = td.Model()
model.add(y2, sess)    #1st add
model.add(y3, sess)    #2nd add
model.save("model.pkl")
ugtony commented 7 years ago

I've tested the caching feature and it works.

Howerver, it is much slower than the tensorflow cpu version. In my experiment, a fully convolutional neural network is used with an image pyramid.

for scale in scales:
    layer = image_pyramid[scale]  #layer size changes with the scale
    uuid = uuid4()
    o1 = out1.eval({input: layer}, uuid)
    o2 = out2.eval({input: layer}, uuid)
    print(layer.shape())
    print(o1.shape())

I guess the speed drop might be caused by 1) the frequent change of input size, or 2) the frequent use of caching feature.

riga commented 7 years ago

In my example, is it correct to simply use the function add(...) twice to create the tfdeploy model?

Yep, that's correct. Overlaps between the two graphs are found automatically via tensor instance caching so there's no need to worry about redundant computations.

If the two tensors you add to the model are somehow related (e.g. if y3 requires/depends on y2), it's also possible to only add the most general tensor (e.g. y3).

I've tested the caching feature and it works. Howerver, it is much slower than the tensorflow cpu version.

Do you mean the caching or the convolution itself?

ugtony commented 7 years ago

Please ignore my guesses.

I measured the running time. The caching feature works, the second eval time is small compared to the first eval. (0.1s and 2.9s in my case)

=== But the operations using tfdeploy is 10 times slower than tensorflow. tfdeploy took 3 seconds on 150x150 images while tensorflow took only 0.3 seconds. I've already turned on the scipy optimization feature while converting the model. The fully convolutional neural network I used is the P-Net described in "Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks" without "facial landmark localization" component.

riga commented 7 years ago

Yep.

tensorflow has the advantage to be fully backed by a customized & optimized C++ backend that performs all heavy operations.

tfdeploy, on the other hand, essentially relies on bare numpy operations which sometimes have to be combined to exactly resemble the behavior of tensorflow. Conv and pooling ops are good examples. The drawback is that combinations are implemented and executed in Python. And sometimes even numpy functions aren't completely backed by equivalent C++ functions, but they use different python calls to achieve the desired functionality.

Concerning the tfdeploy conv and pooling ops: I have one or two ideas that might improve the performance. And maybe it's worth looking into scipy convolve, but this will also require to do some preprocessing, e.g., to ensure the same padding rules.