Open shoma88 opened 6 years ago
Check for #3270 for discussion
i have already checked that and searched on other forums. My problem is a little bit differt. He get a small loss of performances (0.06 s insted of 0.03 s) but i get a huge loss of performances (1.5 s instead of 0.03s). i'm pretty new on Tensorflow so i don't know if i need to configure something to get good performances whit the GPUs. My version of tensorflow support the GPUs and i can see on the cmd prompt that tensorflow is using it. Whit GPU shark i can also see that it uses almost all the free memory of the GPU.
@shoma88
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
I'm experiencing the same thing. Running the object detection tutorial out of the box gives me 2-3 seconds per image, instead of the suggested 20-30ms, so it's a whopping 100x slower. My Tensorflow was built with GPU support, running on a GTX 980.
When I add the log_device_placement=True
I can see additional output in the Jupyter console:
2018-06-16 08:28:04.870455: I tensorflow/core/common_runtime/placer.cc:886] SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack/range/delta: (Const)/job:localhost/replica:0/task:0/device:GPU:0
SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_1/range/start: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-06-16 08:28:04.870463: I tensorflow/core/common_runtime/placer.cc:886] SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_1/range/start: (Const)/job:localhost/replica:0/task:0/device:GPU:0
SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_1/range/delta: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-06-16 08:28:04.870472: I tensorflow/core/common_runtime/placer.cc:886] SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_1/range/delta: (Const)/job:localhost/replica:0/task:0/device:GPU:0
SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_2/range/start: (Const): /job:localhost/replica:0/task:0/device:GPU:0
I'm confused about #3270. Is this really the same issue? There only the first iteration was in the order of seconds. Could it be that also the demo code changed for the worse, now slowing down every iteration instead of only the first one?
This sounds like it could be related to general performance tuning, rather than object detection in particular, and you might have better luck getting help on Stack Overflow. That said, this looks related to #4266 -- @derekjchow , any ideas there or here?
The tutorial was wrote in a way creating One new TF session for One image inference. Not good.
I wrote a new func can inference multiple images without creating new TF session. Find here. On GTX1060, I can reach 6~12 FPS depends on which model I'm using.
But for general tuning tips, I could not find good articles for object detection, so I requested it (issue#4495)
I have the same issue. Training a batch of 8 with a Faster RCNN Resnet 50 lasts 0.7sec / step, but performing inference and measuring time for session.run(....) yields a consistent 2.2sec / image inference time. It does load the graph onto the GPU! This was not an issue before! What is going on?
@dennywangtenk solution works for me.
I was creating a new session every time, like this:
graph_global = None
TENSOR = None
NUM_ITERACOES = 10
BATCH = 1
SOMA = 0
def loadModel():
global graph_global
# Load the protobuf file from the disk and parse it to retrieve the
# unserialized graph_def
with tf.gfile.GFile("PATH_TO_PB_FILE/0123456789.pb", "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
# Then, we import the graph_def into a new Graph and returns it
with tf.Graph().as_default() as graph:
# The name var will prefix every op/nodes in your graph
# Since we load everything in a new graph, this is not needed
tf.import_graph_def(graph_def, name="prefix")
# Make the graph global
graph_global = graph
# Do the inference
def inferencia():
global BATCH
global NUM_ITERACOES
global SOMA
for i in range(10):
# Load an unique image
image = Image.open("PATH_TO_IMGS/" + str(i) + ".png")
# Reshape and transform the image in np array
(im_width, im_height) = image.size
image = np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)
# Run the inference and calculate the time
x = graph_global.get_tensor_by_name('prefix/image_tensor:0')
y = graph_global.get_tensor_by_name('prefix/detection_boxes:0')
with tf.Session(graph=graph_global) as sess:
start = time.time()
res = sess.run(y, feed_dict={x: np.expand_dims(image, 0)})
end = time.time()
print(end - start)
if __name__ == '__main__':
loadModel()
print("Inference time: ")
inferencia()
And I got high processing time for each image:
3.3824450969696045
3.4309332370758057
3.384895086288452
3.354207754135132
3.3493900299072266
3.5104100704193115
3.4564030170440674
3.632082939147949
3.402231216430664
3.301783561706543
So, I moved the with tf.Session(graph=graph_global) as sess:
line out of the looping, like this:
graph_global = None
TENSOR = None
NUM_ITERACOES = 10
BATCH = 1
SOMA = 0
def loadModel():
global graph_global
# Load the protobuf file from the disk and parse it to retrieve the
# unserialized graph_def
with tf.gfile.GFile("PATH_TO_PB_FILE/0123456789.pb", "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
# Then, we import the graph_def into a new Graph and returns it
with tf.Graph().as_default() as graph:
# The name var will prefix every op/nodes in your graph
# Since we load everything in a new graph, this is not needed
tf.import_graph_def(graph_def, name="prefix")
# Make the graph global
graph_global = graph
# Do the inference
def inferencia(sess):
global BATCH
global NUM_ITERACOES
global SOMA
for i in range(10):
#inserirImagens(NUM_IMAGES)
# Load an unique image
image = Image.open("PATH_TO_IMGS/" + str(i) + ".png")
# Reshape and transform the image in np array
(im_width, im_height) = image.size
image = np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)
# Run the inference and calculate the time
start = time.time()
res = sess.run(y, feed_dict={x: np.expand_dims(image, 0)})
end = time.time()
total_time = end - start
SOMA = SOMA + total_time
print(end - start)
plt.imshow(image)
#plt.show()
if __name__ == '__main__':
loadModel()
x = graph_global.get_tensor_by_name('prefix/image_tensor:0')
y = graph_global.get_tensor_by_name('prefix/detection_boxes:0')
print("Time required: ")
with tf.Session(graph=graph_global) as sess:
print("Inference time: ")
inferencia(sess)
And finally I got low time values
3.255561351776123
0.08460521697998047
0.08266162872314453
0.08443999290466309
0.10549187660217285
0.08379626274108887
0.086212158203125
0.08907151222229004
0.10163354873657227
0.08518075942993164
@daviduarte Do you know why only first time takes longer time?
it's the time needed for the system to set up the resources, in this case setting up CUDA cores and memory.
Is there a TF trick to initialize the system resources before running the first inferences in something like a model initialization step so that the first image inference performs like the subsequent inferences?
Any idea to have the session having the graph already loaded stored as an instance variable in a class? I Implemented a class to wrap the whole inference using this approach:
class ObjectDetector(object):
def __init__(self,model_name):
self.model_name = model_name
self.graph = tf.Graph()
self.num_class = 1
self.initialize_graph()
self.initialize_labels()
self.session = None
def __del__(self):
if self.session is not None:
self.session.close()
def run_inference_for_single_image(self,image):
if self.session is None:
with self.graph.as_default():
self.session = tf.Session()
# ......
# use the session to predict
# output_dict = self.session.run(.......
but after the session is used, the graph is not loaded. I had to move the rest of the content of the function run_inference_for_single_image under the with self.graph.as_default() to make it work, and as the consequence, i have to reload the graph everytime, suffering in terms of processing time.
Is there a way around this, or do i understand how sessions and graphs work wrongly?
@daviduarte A simple solution that I've never thought: removing the session from the loop. Genius. Now I can use real time detection. Obrigado :)
Hi guys I am as well new to running TF pretrained models and even ML, but I am experiencing something very weird. I running inference for one image it takes even 30 seconds so it must be my environment since everyone is talking about milliseconds. Even ifI have a system with GeForce GTX 1060, and the environment (Windows 10, cudnn 7.5.0) seems to be correctly setup (name: "/device:GPU:0" device_type: "GPU" memory_limit: 2213819187) it seems to be almost unusable.
Towards the end of the running session I see these log entries: Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Adding visible gpu devices: 0 But the problem is that before it gets to these it takes the most of time I have attached the logs and I hope it helps TFLogs.txt
Any idea to have the session having the graph already loaded stored as an instance variable in a class? I Implemented a class to wrap the whole inference using this approach:
class ObjectDetector(object): def __init__(self,model_name): self.model_name = model_name self.graph = tf.Graph() self.num_class = 1 self.initialize_graph() self.initialize_labels() self.session = None def __del__(self): if self.session is not None: self.session.close() def run_inference_for_single_image(self,image): if self.session is None: with self.graph.as_default(): self.session = tf.Session() # ...... # use the session to predict # output_dict = self.session.run(.......
but after the session is used, the graph is not loaded. I had to move the rest of the content of the function run_inference_for_single_image under the with self.graph.as_default() to make it work, and as the consequence, i have to reload the graph everytime, suffering in terms of processing time.
Is there a way around this, or do i understand how sessions and graphs work wrongly?
I managed to do what you asked, thanks to the useful insight of @daviduarte. You have to load the graph as a class variable upon initialization, like so:
def __init__(self):
self.label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
self.categories = label_map_util.convert_label_map_to_categories(self.label_map, max_num_classes=NUM_CLASSES,use_display_name=True)
self.category_index = label_map_util.create_category_index(self.categories)
# Load a (frozen) Tensorflow model into memory.
print(color.BOLD + color.RED + 'INIT GRAPH' + color.END)
with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
graph_def = tf.GraphDef()
graph_def.ParseFromString(fid.read())
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def, name='prefix')
self.detection_graph = graph
I also defined a method to perform the detection, something like this:
def detect(self, sess, image):
image_np_expanded = np.expand_dims(image, axis=0)
# Extract image tensor
image_tensor = self.detection_graph.get_tensor_by_name('prefix/image_tensor:0')
# Extract detection boxes
boxes = self.detection_graph.get_tensor_by_name('prefix/detection_boxes:0')
# Extract detection scores
scores = self.detection_graph.get_tensor_by_name('prefix/detection_scores:0')
# Extract detection classes
classes = self.detection_graph.get_tensor_by_name('prefix/detection_classes:0')
# Extract number of detectionsd
num_detections = self.detection_graph.get_tensor_by_name('prefix/num_detections:0')
# Actual detection.
start = time.time()
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
end = time.time()
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image,
np.squeeze(boxes),
np.squeeze(classes).astype(np.int32),
np.squeeze(scores),
self.category_index,
use_normalized_coordinates=True,
line_thickness=8)
print(str(end-start))
cv2.imshow('object detection', image)
I actually get the frames from a Kinect camera which is triggered inside the detect()
method, but I think it's the same if you pass the image to it outside the method (being it a frame obtained from camera or a saved image it doesn't matter much at this point).
The real trick is performed in the main()
:
def main():
updater = ObjectDetector()
with tf.Session(graph=updater.detection_graph) as sess:
while True:
updater.detect(sess)
You have to define the session outside every kind of loop, and perform the checks to quit the program inside the inner while loop.
Hope it helps!
Thank you so much! I will try it very soon!
Best regards,
Adriano!
Il giorno 11 lug 2019, alle ore 10:47, Cristina Nuzzi notifications@github.com<mailto:notifications@github.com> ha scritto:
Any idea to have the session having the graph already loaded stored as an instance variable in a class? I Implemented a class to wrap the whole inference using this approach:
class ObjectDetector(object): def init(self,model_name): self.model_name = model_name self.graph = tf.Graph() self.num_class = 1 self.initialize_graph() self.initialize_labels() self.session = None
def __del__(self):
if self.session is not None:
self.session.close()
def run_inference_for_single_image(self,image):
if self.session is None:
with self.graph.as_default():
self.session = tf.Session()
# ......
# use the session to predict
# output_dict = self.session.run(.......
but after the session is used, the graph is not loaded. I had to move the rest of the content of the function run_inference_for_single_image under the with self.graph.as_default() to make it work, and as the consequence, i have to reload the graph everytime, suffering in terms of processing time.
Is there a way around this, or do i understand how sessions and graphs work wrongly?
I managed to do what you asked, thanks to the useful insight of @daviduartehttps://github.com/daviduarte. You have to load the graph as a class variable upon initialization, like so:
def init(self): self.label_map = label_map_util.load_labelmap(PATH_TO_LABELS) self.categories = label_map_util.convert_label_map_to_categories(self.label_map, max_num_classes=NUM_CLASSES,use_display_name=True) self.category_index = label_map_util.create_category_index(self.categories)
print(color.BOLD + color.RED + 'INIT GRAPH' + color.END)
with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
graph_def = tf.GraphDef()
graph_def.ParseFromString(fid.read())
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def, name='prefix')
self.detection_graph = graph
I also defined a method to perform the detection, something like this:
def detect(self, sess, image): image_np_expanded = np.expand_dims(image, axis=0)
image_tensor = self.detection_graph.get_tensor_by_name('prefix/image_tensor:0')
# Extract detection boxes
boxes = self.detection_graph.get_tensor_by_name('prefix/detection_boxes:0')
# Extract detection scores
scores = self.detection_graph.get_tensor_by_name('prefix/detection_scores:0')
# Extract detection classes
classes = self.detection_graph.get_tensor_by_name('prefix/detection_classes:0')
# Extract number of detectionsd
num_detections = self.detection_graph.get_tensor_by_name('prefix/num_detections:0')
# Actual detection.
start = time.time()
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
end = time.time()
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image,
np.squeeze(boxes),
np.squeeze(classes).astype(np.int32),
np.squeeze(scores),
self.category_index,
use_normalized_coordinates=True,
line_thickness=8)
print(str(end-start))
cv2.imshow('object detection', image)
I actually get the frames from a Kinect camera which is triggered inside the detect() method, but I think it's the same if you pass the image to it outside the method (being it a frame obtained from camera or a saved image it doesn't matter much at this point).
The real trick is performed in the main():
def main(): updater = ObjectDetector() with tf.Session(graph=updater.detection_graph) as sess: while True: updater.detect(sess)
You have to define the session outside every kind of loop, and perform the checks to quit the program inside the inner while loop.
Hope it helps!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tensorflow/models/issues/4355?email_source=notifications&email_token=AJOAJHGUIZMT5ZTNV25LMF3P63XTLA5CNFSM4FBPLLXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZV7ONY#issuecomment-510392119, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJOAJHAXX7GQU7EBMMQ67S3P63XTLANCNFSM4FBPLLXA.
@daviduarte do you have any idea how to code this with fast inference in C++?
Is there a TF trick to initialize the system resources before running the first inferences in something like a model initialization step so that the first image inference performs like the subsequent inferences?
@svencowart Have you found a good solution to this question?
Is there a TF trick to initialize the system resources before running the first inferences in something like a model initialization step so that the first image inference performs like the subsequent inferences?
@svencowart Have you found a good solution to this question?
an obvious hack would be to immediately perform forward pass with an array of zeros after loading the model, ugly, but it works..
The first run is slow, the rest are fine.
Inference time: 0 sec @ loop 28.139413833618164. Inference time: 1 sec @ loop 0.45125246047973633. Inference time: 2 sec @ loop 0.4427633285522461. Inference time: 3 sec @ loop 0.436007022857666. Inference time: 4 sec @ loop 0.413358211517334. Inference time: 5 sec @ loop 0.41444969177246094. Inference time: 6 sec @ loop 0.40434837341308594. Inference time: 7 sec @ loop 0.4182426929473877. Inference time: 8 sec @ loop 0.42828917503356934. Inference time: 9 sec @ loop 0.4369192123413086. Total Inference time: 31.987273454666138 sec @ 10 images. faster_rcnn_inception_resnet_v2_atrous_oid_v4_2018_12_12
The process of converting PIL image object to an numpy array: image = np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)
may be one performance bottleneck of inference. See my answer in stackoverflow: https://stackoverflow.com/a/57716322/6437725
What if the model is deployed using tensorflow serving on to k8s cluster , how can be the inference time be improved ?
I am using a SSD re-trained model , used tensorflow serving image to containerize it and then deployed it onto k8s cluster. CUDA 9 , K80 gpu environment is setup on the k8s node. Model is accessible on the IP endpoint, and it performs inference for single image in 2-4 seconds. On the client side , I am converting a image to numpy array and passing it to model using requests.post() and printing model response as json.
image=PIL.image.open('image.jpeg')
image_np=numpy.array(image)
payload={"instances": [image_np.tolist()]}
res=request.post("http://model IP_endpoint:predict", json=payload)
print(res.json())
Since there is no post processing done , what can be improved here ?? Any help is much appreciated. Thanks
System information
Exact command to reproduce: "object_detection_tutorial.py" from "models-master\research\object_detection" i added "import time" on the top and the following command to print the processing time:
start_time = time.time()
Run inference
output_dict = sess.run(tensor_dict, feed_dict={image_tensor: np.expand_dims(image, 0)}) print("--- %s seconds ---" % (time.time() - start_time))
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the problem
The processing time is very high. Using the CPU i get 1.5 Second for each image. Using the GPU i get an higher time (1.7 Second). I get bad performances whit the standard model (dogs and people on the beach) and also whit a retrained model. I have the same problem with the Cifar10 example. The performances during the train seems to be fine (7000 samples per sec) but the processing time on the eval.py code is huge (9 Seconds). Do you have any suggestions about that? i can find a lot of benchmarks about the training time but not many about the evaluation time so i don't know how much time should i need to process a single image.
thanks,
Adriano.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.