Closed yangyang-nus-lv closed 2 years ago
@yangyang-nus-lv could you run the same test on Torchserve v0.5.3 to see if the issue is still there?
I ran resnet-18 soak test on v0.5.3. The attached picture shows memory usage is stable.
#!/usr/bin/env bash
while true; do
ab -q -c 10 -n 1000000 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18
done
Regarding frontend memory, Torchserve frontend is a Java based frontend. Usually Java application memory usage is wave style due to GC. You can set the following parameters in config.properties.
vmargs=-Xmx128m -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError
Hi @lxning Thanks for your reply. I also did a test on the official docker docker pull pytorch/torchserve:0.5.3-cpu
with your JVM setting. The total memory usage curve seems to go up when starting load testing. However, it did not drop as the testing stopped. Both increasing trend and non-drop are the same issues as we raised here. May I ask what is the decline meaning on your memory usage graph? Did it happen immediately after stopping query?
Hi @lxning , here are the steps I did the test on a docker container.
.
├── model-store
│ └── resnet-18.mar
├── request.sh
├── resnet
│ ├── config.properties
│ ├── index_to_name.json
│ ├── kitten.jpg
│ ├── model.py
│ └── resnet18-f37072fd.pth
Inside resnet
folder, the model file and others are copied from serve/examples/image_classifier
.
# in /my_workspace/, clone torchserve
git clone https://github.com/pytorch/serve.git
# copy examples
mkdir model_store
mkdir resnet
cp ./serve/examples/image_classifier/resnet_18/model.py resnet/model.py
cp ./serve/examples/image_classifier/index_to_name.json resnet/index_to_name.json
cp ./serve/examples/image_classifier/kitten.jpg resnet/kitten.jpg
wget https://download.pytorch.org/models/resnet18-f37072fd.pth -P ./resnet/resnet18-f37072fd.pth
torch-model-archiver --model-name resnet-18 \
--export-path ./model-store \
--version 1 \
--model-file resnet/model.py \
--serialized-file resnet/resnet18-f37072fd.pth \
--handler image_classifier \
--extra-files resnet/index_to_name.json -f
I run this command to archive the mar file under a python 3.9.5 virtual env:
Package Version
----------------------- ---------
torch 1.10.0
torch-model-archiver 0.5.3
torch-workflow-archiver 0.2.3
torchaudio 0.10.0
torchserve 0.5.3
torchtext 0.11.0
torchvision 0.11.1
# MacBook Pro (16-inch, 2019); Processor 2.6 GHz 6-Core Intel Core i7; Memory 16 GB 2667 MHz DDR4
docker run --rm -it -p 8080:8080 -p 8081:8081 -p 8082:8082 --name resnet -v $(pwd)/model-store:/home/model-server/model-store -v $(pwd)/resnet:/home/model-server/resnet pytorch/torchserve:0.5.3-cpu torchserve --start --model-store model-store --models resnet-18=resnet-18.mar --ts-config resnet/config.properties --ncs
where the config.properties was set as:
inference_address=https://0.0.0.0:8080
management_address=https://0.0.0.0:8081
metrics_address=https://0.0.0.0:8082
default_workers_per_model=1
enable_envvars_config=true
vmargs=-Xmx128m -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError
# outside the docker container, monitoring memory usage
while true; do docker stats --no-stream | grep resnet | awk -v date="$(date +%T)" '{print $4, date}' | sed -e 's/MiB//g' >> mem_log.txt; done
# load testing, edited in `/my_workspace/request.sh`
#!/usr/bin/env bash
ab -q -c 5 -n 100000 -k -p resnet/kitten.jpg -T application/jpg https://127.0.0.1:8080/predictions/resnet-18
Then I got the memory usage curve shown in above comment, where red arrow shows the starting and stopping time of load testing request.
@yangyang-nus-lv Thank you for the detail information.
TorchServe allocates a data queue for each model. The queue capacity size is 100 by default. It is configurable (see job_queue_size). The queue can be reclaimed only if the model is unregistered. That's why you can see the memory usage drop in my test case.
I can see the following setting in your test case:
"#workers=1 < ab clients=5" causes the incoming requests to be accumulated in the data queue. So you can see the memory keep increasing until the queue hits the capacity size and more and more data will stay in netty server queue. The data queue will not be reclaimed even though there is no incoming data any more. The memory can be reclaimed when the model is unregistered.
I did some testing on EC2 m6i.4xlarge instance again. It can prove the memory is reclaimed once the model is unregistered.
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=100
vmargs=-Xmx4g -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError
prefer_direct_buffer=True
default_response_timeout=300
unregister_model_timeout=300
install_py_dep_per_model=true
start torchserve: <== MemoryUtilization.Percent:2.0
register resnet-18 with 16 workers: <== MemoryUtilization.Percent:6.5 curl -v -X POST "http://localhost:8081/models?initial_workers=16&synchronous=false&url=resnet-18.mar"
inference request: <== MemoryUtilization.Percent:7.9 ab -q -c 10 -n 100000 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-1
unregister resnet-18: <== MemoryUtilization.Percent:3.1 curl -X DELETE http://localhost:8081/models/resnet-18
re-register resnet-18 with 16 workers: <== MemoryUtilization.Percent:7.5 curl -v -X POST "http://localhost:8081/models?initial_workers=16&synchronous=false&url=resnet-18.mar"
force gc: <== MemoryUtilization.Percent:6.5 jmap pid GC.RUN
inference request: <== MemoryUtilization.Percent:6.9 ab -q -c 1 -n 10 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18
Hi @lxning , can you clarify for me and @yangyang-nus-lv what job_queue_size
does? In the documentation, it says:
Number inference jobs that frontend will queue before backend can serve. Default: 100.
What happens when it hits 100? From the wording above, I interpret it as it will only start reaching the backend/inferencing once it hits 100, but this can't be the case. I would expect that it would start dropping messages if the queue is full.
Also, even if unregistering the model frees up memory as it clears the queue, would you suggest that periodically reregister the model to free up memory?
Hi @lxning I reproduced it by following your procedures. As shown below, I tagged the step for each segment of the memory usage curve.
# config
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=8
job_queue_size=100
vmargs=-Xmx128m -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError
prefer_direct_buffer=True
default_response_timeout=300
unregister_model_timeout=300
install_py_dep_per_model=true
# init workers
curl -v -X POST "http://localhost:8081/models?initial_workers=4&synchronous=false&url=resnet-18.mar"
# first inference
ab -q -c 2 -n 5000 -k -p resnet/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18
# second inference
ab -q -c 1 -n 5000 -k -p resnet/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18
I have some observations and questions:
jmap <pid> GC.RUN
command in the official docker. What is correct one in openjdk-11 env to force collect java garbage? I am not familiar with Java stuff.Basically, we would not like to re-register the model in productions, since it interrupts the services. However it seems (manually or scheduled) 're-registering + Java GC' together is the only solution for a memory stable model serving, cause 're-registering without Java GC' still hits a higher level. Is my understanding correct? We sincerely hope there is a smoother way help us out.
Thanks @yangyang-nus-lv and @jlloh for sharing the configs, you should not need to re-register the model, if the root cause of this issue is the in the behavior that the queue can be reclaimed only if the model is unregistered, we would need to revisit it. We are looking into this issue and will update you.
The following graph shows the memory gets released. Detailed information can be found at Here.
@lxning thank you for the investigation. I may still have some questions, could you help me understand about the graphs? First, as we can see from the three memory allocation break down graph, the total allocated memory actually gets doubled from 1st time run to 2nd time. Also the number of KQueue goes up from 4 (both the concurrency and workers are set to be 4) to 8. So what is the total allocation here, is it a temporary value at some time frame? It seems not shown on the heap graph, as the last total memory is around 140Mb while the maximum usage on heap is sub 80Mb. And the heap curve during idle time is increasing linearly, what is reason explains that? Thanks in advance.
@yangyang-nus-lv Here is the reply for your questions
"the total allocated memory actually gets doubled" In visualVM, "total allocated memory" shows how much memory was allocated during your sampling. Basically, this number is keep increasing as long as the sampling is running.
"the number of KQueue goes up" the number of KQueueEventLoopGroup threads is configured by "number_of_netty_threads" and "netty_client_threads" (see config ). In my test, both of them use default value (ie. #core=16). With the incoming message rate, the number of live KQueueEventLoop is dynamic.
"And the heap curve during idle time is increasing linearly" This is triggered by RMI TCP Connection thread which is used to feed visualVM (ie. my profiling tool) with data from TorchServe JVM. It is not continuously linearly increasing. It recovers every a few minutes (see the tailer part in figure1).
Guys this seem to still be a issue
@yangyang-nus-lv Thank you for the detail information.
TorchServe allocates a data queue for each model. The queue capacity size is 100 by default. It is configurable (see job_queue_size). The queue can be reclaimed only if the model is unregistered. That's why you can see the memory usage drop in my test case.
I can see the following setting in your test case:
- "default_workers_per_model=1"
- job_queue_size=100 (default)
- ab client concurrency is 5
"#workers=1 < ab clients=5" causes the incoming requests to be accumulated in the data queue. So you can see the memory keep increasing until the queue hits the capacity size and more and more data will stay in netty server queue. The data queue will not be reclaimed even though there is no incoming data any more. The memory can be reclaimed when the model is unregistered.
I did some testing on EC2 m6i.4xlarge instance again. It can prove the memory is reclaimed once the model is unregistered.
- config.properties
inference_address=http://0.0.0.0:8080 management_address=http://0.0.0.0:8081 metrics_address=http://0.0.0.0:8082 number_of_netty_threads=32 job_queue_size=100 vmargs=-Xmx4g -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError prefer_direct_buffer=True default_response_timeout=300 unregister_model_timeout=300 install_py_dep_per_model=true
- steps:
- export OMP_THREAD_LIMIT=1
- start torchserve: <== MemoryUtilization.Percent:2.0
- register resnet-18 with 16 workers: <== MemoryUtilization.Percent:6.5 curl -v -X POST "http://localhost:8081/models?initial_workers=16&synchronous=false&url=resnet-18.mar"
- inference request: <== MemoryUtilization.Percent:7.9 ab -q -c 10 -n 100000 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-1
- unregister resnet-18: <== MemoryUtilization.Percent:3.1 curl -X DELETE http://localhost:8081/models/resnet-18
- re-register resnet-18 with 16 workers: <== MemoryUtilization.Percent:7.5 curl -v -X POST "http://localhost:8081/models?initial_workers=16&synchronous=false&url=resnet-18.mar"
- force gc: <== MemoryUtilization.Percent:6.5 jmap pid GC.RUN
- inference request: <== MemoryUtilization.Percent:6.9 ab -q -c 1 -n 10 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18
When the queue is full, then memory reclaims automatically?
Context
I deployed the Resnet-18 eager mode model from the examples on local linux CPU machine. By monitoring the memory usage, I found that it is increasing as sending requests. And the memory usage of both frontend and service worker never get released even if no request incomes for a while.
Your Environment
Installed using source? [yes/no]: yes
Are you planning to deploy it using docker container? [yes/no]: yes
Is it a CPU or GPU environment?: CPU
Using a default/custom handler? [If possible upload/share custom handler/model]: Default
What kind of model is it e.g. vision, text, audio?: Vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]: Resnet-18
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:
Load Testing and Memory Usage Monitoring
I did the load testing in this way: As serving the resnet example on one worker by specifying
default_workers_per_model=1
inconfig.properties
, there are two related processes can be found byps aux
frontend and service worker. Then I monitor these two's memory usage as continuously sending request in a for loop (as my understanding it is not concurrent). The frontend memory is increasing and never shows a sign to decline.While I did a search on raised issues, after setting LRU and vmargs as suggested in other relevant issues. The memory usage seems to get capped by vmargs setting, but still increases very slowly when doing long term load testing shown at the second graph below. The worker memory shows a release manner. As exactly same as the example, the input image is always same and fixed size.
Expected Behavior vs Current Behavior
There are two things I may get confused about:
Possible Solution
For LRU, I set it in handler initialise(),
os.environ["LRU_CACHE_CAPACITY"] = "1"
For vmargs, I set it inconfig.properties
,vmargs=-Xmx256m