triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.78k stars 1.42k forks source link

Thread safety question about python grpcclient and server #2616

Closed LightToYang closed 3 years ago

LightToYang commented 3 years ago

I ran grpcclient infer() method in multi-thread application (FastAPI), and sometimes the output results are same when inputting different images. The mistake is alway occurred between adjacent inputs.

For examples:

0001.jpg ==> 0001 result

0002.jpg ==> 0002 result  (same) 

0003.jpg ==> 0003 result

0004.jpg ==> 0002 result  (same)

Since I read #1856 as it says python grpcclient infer() is thread safe, what's wrong with my application ?

LightToYang commented 3 years ago

I use nvcr.io/nvidia/tritonserver:20.10-py3, dose it contain the solution of #1427 ?

CoderHam commented 3 years ago

I use nvcr.io/nvidia/tritonserver:20.10-py3, dose it contain the solution of #1427 ?

Yes it does contain the fix from that PR. Can you share a minimal example of your client to repro the same?

LightToYang commented 3 years ago

@CoderHam I find I actually use python client which is not related to cpp client fixed in #1427. Here is a minimal example of my client to repro the same result.

Using the following code to get the face 512-d feature:

def get_embedding(img_path):
    with open(img_path, "rb") as f:
        img = f.read()
        img_bytes = np.frombuffer(img, dtype=np.uint8)[None, :]
        results = pure_feature_infer(img_bytes)

        embedding = results['embedding'][0]    
        norm_embedding = embedding / np.sqrt(np.dot(embedding, embedding))
        return norm_embedding

def pure_feature_infer(
    image, 
    max_length=64000, 
    model_name='Feature', 
    input_names=['DALI_INPUT'], 
    output_names=['embedding']
):
    image_post = image.copy()

    image_post = list(map(lambda img, ml=max_length: np.pad(img, (0, ml - img.shape[0])), image_post))
    image_post = np.stack(image_post)

    input_shape = [1, max_length]
    inputs = []
    for input_name in input_names:
        inputs.append(tritonclient.grpc.InferInput(input_name, input_shape, "UINT8"))
    inputs[0].set_data_from_numpy(image_post) 
    outputs = []
    for output_name in output_names:
        outputs.append(tritonclient.grpc.InferRequestedOutput(output_name))

    results = triton_client.infer(
        model_name=model_name, 
        inputs=inputs, 
        outputs=outputs
    )
    output_results = {}
    for output_name in output_names:
        output_results[output_name] = results.as_numpy(output_name)
    return output_results
LightToYang commented 3 years ago

Then using threadpool to simulate the high concurrency situation:

thread_pool = ThreadPoolExecutor(20)
all_task = []
embedding_list = []
for img_path in img_path_list:
     filepath, tmpfilename = os.path.split(img_path)
     shotname, extension = os.path.splitext(tmpfilename)
     # print(filepath, tmpfilename, shotname, extension)

     all_task.append(thread_pool.submit(get_embedding, (img_path)))

for future in as_completed(all_task):
     norm_embedding = future.result()
     embedding_list.append(norm_embedding)
LightToYang commented 3 years ago

Comparing each face feature with all the face feature

def check_all_data(embedding_array):
    def np_cosine(x,y):
        return np.inner(x,y)*0.5 + 0.5

    total_num = 0
    unmatch_num = 0

    for i, embedding in enumerate(embedding_array):
        sim = np_cosine(embedding, embedding_array)
        index = np.argmax(sim)
        total_num += 1
        if i != index:
            unmatch_num += 1
    print(f'{unmatch_num}/{total_num}')  

embedding_array = np.array(embedding_list, dtype=np.float32)
check_all_data(embedding_array)

However getting a lot of repetitive 512-d.

embedding_array: (11190, 512)
unmatch_num/total_num: 233/11190

I think it is related to somewhere thread unsafety of triton, beacause it's alright when it's running with single thread.

img_path_list = glob.glob(f'{dir_path}/*jpg')
for i, img_path in enumerate(img_path_list):
    norm_embedding = get_embedding(img_path)
    embedding_list.append(norm_embedding)
embedding_array: (11190, 512)
unmatch_num/total_num: 0/11190
LightToYang commented 3 years ago
with ProcessPoolExecutor(max_workers=10) as executor:
        futures = []
        for img_path in img_path_list:
            job = executor.submit(get_embedding, img_path)
            futures.append(job)
        for job in as_completed(futures):
            try:
                norm_embedding = job.result()
                embedding_list.append(norm_embedding)
            except Exception as e:
                print(e)

I replace thread pool with process pool, and get the results like:

(11190, 512)
69/11190

Is that means the duplicated return values are resulted from server but not client? @tanmayv25 By the way, with the above process pool code , sometimes I get Segmentation fault (core dumped) error.

LightToYang commented 3 years ago

This is my config.pbtxt, using DALI, TensorRT, ONNX backend as pre-process, network and post-process respectively. I doubt whether something wrong with one of the above backend ?

name: "Feature"
platform: "ensemble"
max_batch_size: 0
input [
  {
    name: "DALI_INPUT"
    data_type: TYPE_UINT8
    dims: [1, -1]
  }
]
output [
  {
    name: "embedding",
    data_type: TYPE_FP32,
    dims: [1, 512],
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "Feature-Preprocess"
      model_version: 1
      input_map {
        key: "DALI_INPUT"
        value: "DALI_INPUT"
      }
      output_map {
        key: "DALI_OUTPUT"
        value: "DALI_OUTPUT"
      }
    },
    {
      model_name: "Feature-Net"
      model_version: 1
      input_map {
        key: "DALI_OUTPUT"
        value: "DALI_OUTPUT"
      }
      output_map {
        key: "fc1"
        value: "fc1"
      }
    },
    {
      model_name: "Feature-Post"
      model_version: 1
      input_map {
        key: "fc1"
        value: "fc1"
      }
      output_map {
        key: "embedding"
        value: "embedding"
      }
    }
  ]
}
LightToYang commented 3 years ago

triton-inference-server/dali_backend#39

banasraf commented 3 years ago

Hello @LightToYang, you mentioned that sometimes you get Segmentation fault. Does it happen on the client side, or the server side? Also, could you try creating a separate triton client instance for each process/thread to make sure that the thread-safety of the grpc client isn't a problem here?

deadeyegoodwin commented 3 years ago

Closing. Reopen with additional information if issue is not resolved.