TensorRT example with torch.compile

Description

Show TensorRT example with torch.compile
Fixes #(issue)
Type of change

Please delete options that are not relevant.
[ ] Bug fix (non-breaking change which fixes an issue)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] New feature (non-breaking change which adds functionality)
[x] This change requires a documentation update
Feature/Issue validation/testing

2024-06-22T18:40:15,859 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:res50-trt,model_version:default|#hostname:ip-172-31-4-205,timestamp:1719081615
2024-06-22T18:40:15,861 [DEBUG] W-9000-res50-trt_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT repeats 1 to backend at: 1719081615861
2024-06-22T18:40:15,861 [INFO ] W-9000-res50-trt_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1719081615861
2024-06-22T18:40:15,863 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Backend received inference at: 1719081615
2024-06-22T18:40:15,873 [INFO ] W-9000-res50-trt_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_preprocess.Milliseconds:10.249504089355469|#ModelName:res50-trt,Level:Model|#type:GAUGE|#hostname:ip-172-31-4-205,1719081615,7a5845b4-cbfb-4a97-bbc4-6f7900a1870f, pattern=[METRICS]
2024-06-22T18:40:16,795 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Using Default Torch-TRT Runtime (as requested by user)
2024-06-22T18:40:16,796 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Device not specified, using Torch default current device - cuda:0. If this is incorrect, please specify an input device, via the device keyword.
2024-06-22T18:40:16,796 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Compilation Settings: CompilationSettings(enabled_precisions={<dtype.f32: 7>}, debug=False, workspace_size=0, min_block_size=5, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, refit=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False)
2024-06-22T18:40:16,796 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - 
2024-06-22T18:40:18,108 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Node _param_constant0 of op type get_attr does not have metadata. This could sometimes lead to undefined behavior.
2024-06-22T18:40:18,109 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Some nodes do not have metadata (shape and dtype information). This could lead to problems sometimes if the graph has PyTorch and TensorRT segments.
2024-06-22T18:40:18,762 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 398, GPU 568 (MiB)
2024-06-22T18:40:20,965 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - [MemUsageChange] Init builder kernel library: CPU +1621, GPU +290, now: CPU 2167, GPU 858 (MiB)
2024-06-22T18:40:22,096 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - TRT INetwork construction elapsed time: 0:00:01.094557
2024-06-22T18:40:22,113 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Global timing cache in use. Profiling results in this builder pass will be stored.

2024-06-22T18:40:35,970 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Detected 1 inputs and 1 output network tensors.
2024-06-22T18:40:36,566 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Total Host Persistent Memory: 363840
2024-06-22T18:40:36,567 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Total Device Persistent Memory: 6656
2024-06-22T18:40:36,567 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Total Scratch Memory: 524800
2024-06-22T18:40:36,567 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - [BlockAssignment] Started assigning block shifts. This will take 97 steps to complete.
2024-06-22T18:40:36,568 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - [BlockAssignment] Algorithm ShiftNTopDown took 2.07453ms to assign 5 blocks to 97 nodes requiring 7326208 bytes.
2024-06-22T18:40:36,569 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Total Activation Memory: 7325696
2024-06-22T18:40:36,569 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Total Weights Memory: 112691616
2024-06-22T18:40:36,574 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Engine generation completed in 14.4618 seconds.
2024-06-22T18:40:36,575 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 115 MiB
2024-06-22T18:40:36,640 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 4113 MiB
2024-06-22T18:40:36,644 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Serialized 26 bytes of code generator cache.
2024-06-22T18:40:36,645 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Serialized 320 timing cache entries
2024-06-22T18:40:36,645 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - Build TRT engine elapsed time: 0:00:14.548270
2024-06-22T18:40:36,645 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_LOG - TRT Engine uses: 114533172 bytes of Memory
2024-06-22T18:40:37,268 [WARN ] W-9000-res50-trt_1.0-stderr MODEL_LOG - WARNING: [Torch-TensorRT] - Using default stream in enqueueV3() may lead to performance issues due to additional calls to cudaStreamSynchronize() by TensorRT to ensure correct synchronization. Please use non-default stream instead.
2024-06-22T18:40:37,270 [INFO ] W-9000-res50-trt_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_inference.Milliseconds:21395.96484375|#ModelName:res50-trt,Level:Model|#type:GAUGE|#hostname:ip-172-31-4-205,1719081637,7a5845b4-cbfb-4a97-bbc4-6f7900a1870f, pattern=[METRICS]
2024-06-22T18:40:37,308 [INFO ] W-9000-res50-trt_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_postprocess.Milliseconds:38.030784606933594|#ModelName:res50-trt,Level:Model|#type:GAUGE|#hostname:ip-172-31-4-205,1719081637,7a5845b4-cbfb-4a97-bbc4-6f7900a1870f, pattern=[METRICS]
2024-06-22T18:40:37,308 [INFO ] W-9000-res50-trt_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:21445.23|#ModelName:res50-trt,Level:Model|#type:GAUGE|#hostname:ip-172-31-4-205,1719081637,7a5845b4-cbfb-4a97-bbc4-6f7900a1870f, pattern=[METRICS]
2024-06-22T18:40:37,309 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_METRICS - HandlerTime.ms:21445.23|#ModelName:res50-trt,Level:Model|#hostname:ip-172-31-4-205,requestID:7a5845b4-cbfb-4a97-bbc4-6f7900a1870f,timestamp:1719081637
2024-06-22T18:40:37,309 [INFO ] W-9000-res50-trt_1.0 org.pytorch.serve.wlm.BatchAggregator - Sending response for jobId 7a5845b4-cbfb-4a97-bbc4-6f7900a1870f
2024-06-22T18:40:37,309 [INFO ] W-9000-res50-trt_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]PredictionTime.Milliseconds:21445.32|#ModelName:res50-trt,Level:Model|#type:GAUGE|#hostname:ip-172-31-4-205,1719081637,7a5845b4-cbfb-4a97-bbc4-6f7900a1870f, pattern=[METRICS]
2024-06-22T18:40:37,309 [INFO ] W-9000-res50-trt_1.0-stdout MODEL_METRICS - PredictionTime.ms:21445.32|#ModelName:res50-trt,Level:Model|#hostname:ip-172-31-4-205,requestID:7a5845b4-cbfb-4a97-bbc4-6f7900a1870f,timestamp:1719081637
2024-06-22T18:40:37,310 [INFO ] W-9000-res50-trt_1.0 ACCESS_LOG - /127.0.0.1:49456 "PUT /predictions/res50-trt HTTP/1.1" 200 21453
2024-06-22T18:40:37,311 [INFO ] W-9000-res50-trt_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-172-31-4-205,timestamp:1719081637
2024-06-22T18:40:37,311 [INFO ] W-9000-res50-trt_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2.1448542294E7|#model_name:res50-trt,model_version:default|#hostname:ip-172-31-4-205,timestamp:1719081637
2024-06-22T18:40:37,311 [INFO ] W-9000-res50-trt_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:181.33|#model_name:res50-trt,model_version:default|#hostname:ip-172-31-4-205,timestamp:1719081637
2024-06-22T18:40:37,312 [DEBUG] W-9000-res50-trt_1.0 org.pytorch.serve.job.RestJob - Waiting time ns: 181330, Backend time ns: 21450549064
2024-06-22T18:40:37,312 [INFO ] W-9000-res50-trt_1.0 TS_METRICS - QueueTime.Milliseconds:0.0|#Level:Host|#hostname:ip-172-31-4-205,timestamp:1719081637
2024-06-22T18:40:37,312 [INFO ] W-9000-res50-trt_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 21447
2024-06-22T18:40:37,312 [INFO ] W-9000-res50-trt_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:4.0|#Level:Host|#hostname:ip-172-31-4-205,timestamp:1719081637
{
  "tabby": 0.27221813797950745,
  "tiger_cat": 0.13754481077194214,
  "Egyptian_cat": 0.04620043560862541,
  "lynx": 0.003195191267877817,
  "lens_cap": 0.00225762533955276
Checklist:

[ ] Did you have fun?
[ ] Have you added tests that prove your fix is effective or that this feature works?
[ ] Has code been commented, particularly in hard-to-understand areas?
[ ] Have you made corresponding changes to the documentation?
pytorch / serve

TensorRT example with torch.compile #3203

Description

Type of change

Feature/Issue validation/testing

Checklist: