pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.16k stars 840 forks source link

Memory keeps increasing as sending requests - CPU Inference #1528

Closed yangyang-nus-lv closed 2 years ago

yangyang-nus-lv commented 2 years ago

Context

I deployed the Resnet-18 eager mode model from the examples on local linux CPU machine. By monitoring the memory usage, I found that it is increasing as sending requests. And the memory usage of both frontend and service worker never get released even if no request incomes for a while.

Your Environment

Load Testing and Memory Usage Monitoring

I did the load testing in this way: As serving the resnet example on one worker by specifying default_workers_per_model=1 in config.properties, there are two related processes can be found by ps aux frontend and service worker. Then I monitor these two's memory usage as continuously sending request in a for loop (as my understanding it is not concurrent). The frontend memory is increasing and never shows a sign to decline. image

While I did a search on raised issues, after setting LRU and vmargs as suggested in other relevant issues. The memory usage seems to get capped by vmargs setting, but still increases very slowly when doing long term load testing shown at the second graph below. The worker memory shows a release manner. As exactly same as the example, the input image is always same and fixed size. image image

Expected Behavior vs Current Behavior

There are two things I may get confused about:

  1. The frontend memory usage increases with incoming requests.
  2. In idle time, the memory usage won't decrease or flush out.

    Possible Solution

    For LRU, I set it in handler initialise(), os.environ["LRU_CACHE_CAPACITY"] = "1" For vmargs, I set it in config.properties, vmargs=-Xmx256m

lxning commented 2 years ago

@yangyang-nus-lv could you run the same test on Torchserve v0.5.3 to see if the issue is still there?

I ran resnet-18 soak test on v0.5.3. The attached picture shows memory usage is stable.

#!/usr/bin/env bash
while true; do
    ab -q -c 10 -n 1000000 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18
done

Regarding frontend memory, Torchserve frontend is a Java based frontend. Usually Java application memory usage is wave style due to GC. You can set the following parameters in config.properties.

vmargs=-Xmx128m -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError
Screen Shot 2022-03-25 at 7 23 59 PM
yangyang-nus-lv commented 2 years ago

Hi @lxning Thanks for your reply. I also did a test on the official docker docker pull pytorch/torchserve:0.5.3-cpu with your JVM setting. The total memory usage curve seems to go up when starting load testing. However, it did not drop as the testing stopped. Both increasing trend and non-drop are the same issues as we raised here. May I ask what is the decline meaning on your memory usage graph? Did it happen immediately after stopping query?

memory_usage_0122_3003

yangyang-nus-lv commented 2 years ago

Hi @lxning , here are the steps I did the test on a docker container.

  1. Prepare model files To shorten the path for easy use, I created the following folders in my workspace directory:
    .
    ├── model-store
    │   └── resnet-18.mar
    ├── request.sh
    ├── resnet
    │   ├── config.properties
    │   ├── index_to_name.json
    │   ├── kitten.jpg
    │   ├── model.py
    │   └── resnet18-f37072fd.pth

    Inside resnet folder, the model file and others are copied from serve/examples/image_classifier .

    # in /my_workspace/, clone torchserve
    git clone https://github.com/pytorch/serve.git
    # copy examples
    mkdir model_store
    mkdir resnet
    cp ./serve/examples/image_classifier/resnet_18/model.py resnet/model.py
    cp ./serve/examples/image_classifier/index_to_name.json resnet/index_to_name.json
    cp ./serve/examples/image_classifier/kitten.jpg resnet/kitten.jpg
    wget https://download.pytorch.org/models/resnet18-f37072fd.pth -P ./resnet/resnet18-f37072fd.pth
  2. Archive Mar file
    torch-model-archiver --model-name resnet-18 \
    --export-path ./model-store \
    --version 1 \
    --model-file resnet/model.py  \
    --serialized-file resnet/resnet18-f37072fd.pth \
    --handler image_classifier \
    --extra-files resnet/index_to_name.json -f

    I run this command to archive the mar file under a python 3.9.5 virtual env:

    Package                 Version
    ----------------------- ---------
    torch                   1.10.0
    torch-model-archiver    0.5.3
    torch-workflow-archiver 0.2.3
    torchaudio              0.10.0
    torchserve              0.5.3
    torchtext               0.11.0
    torchvision             0.11.1
  3. Start a docker container to run model serving The hardware device I used is an intel cpu based Macbook Pro 2019. Then I run this in above workspace
    # MacBook Pro (16-inch, 2019); Processor 2.6 GHz 6-Core Intel Core i7; Memory 16 GB 2667 MHz DDR4
    docker run --rm -it -p 8080:8080 -p 8081:8081 -p 8082:8082 --name resnet -v $(pwd)/model-store:/home/model-server/model-store -v $(pwd)/resnet:/home/model-server/resnet  pytorch/torchserve:0.5.3-cpu torchserve --start --model-store model-store --models resnet-18=resnet-18.mar --ts-config resnet/config.properties --ncs

    where the config.properties was set as:

    inference_address=https://0.0.0.0:8080
    management_address=https://0.0.0.0:8081
    metrics_address=https://0.0.0.0:8082
    default_workers_per_model=1
    enable_envvars_config=true
    vmargs=-Xmx128m -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError
  4. Load testing and Memory Usage Monitoring
    # outside the docker container, monitoring memory usage
    while true; do docker stats --no-stream | grep resnet | awk -v date="$(date +%T)" '{print $4, date}' | sed -e 's/MiB//g' >> mem_log.txt; done
    # load testing, edited in `/my_workspace/request.sh`
    #!/usr/bin/env bash
    ab -q -c 5 -n 100000 -k -p resnet/kitten.jpg -T application/jpg https://127.0.0.1:8080/predictions/resnet-18

    Then I got the memory usage curve shown in above comment, where red arrow shows the starting and stopping time of load testing request.

lxning commented 2 years ago

@yangyang-nus-lv Thank you for the detail information.

TorchServe allocates a data queue for each model. The queue capacity size is 100 by default. It is configurable (see job_queue_size). The queue can be reclaimed only if the model is unregistered. That's why you can see the memory usage drop in my test case.

I can see the following setting in your test case:

"#workers=1 < ab clients=5" causes the incoming requests to be accumulated in the data queue. So you can see the memory keep increasing until the queue hits the capacity size and more and more data will stay in netty server queue. The data queue will not be reclaimed even though there is no incoming data any more. The memory can be reclaimed when the model is unregistered.

I did some testing on EC2 m6i.4xlarge instance again. It can prove the memory is reclaimed once the model is unregistered.

  1. start torchserve: <== MemoryUtilization.Percent:2.0

  2. register resnet-18 with 16 workers: <== MemoryUtilization.Percent:6.5 curl -v -X POST "http://localhost:8081/models?initial_workers=16&synchronous=false&url=resnet-18.mar"

  3. inference request: <== MemoryUtilization.Percent:7.9 ab -q -c 10 -n 100000 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-1

  4. unregister resnet-18: <== MemoryUtilization.Percent:3.1 curl -X DELETE http://localhost:8081/models/resnet-18

  5. re-register resnet-18 with 16 workers: <== MemoryUtilization.Percent:7.5 curl -v -X POST "http://localhost:8081/models?initial_workers=16&synchronous=false&url=resnet-18.mar"

  6. force gc: <== MemoryUtilization.Percent:6.5 jmap pid GC.RUN

  7. inference request: <== MemoryUtilization.Percent:6.9 ab -q -c 1 -n 10 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18

jlloh commented 2 years ago

Hi @lxning , can you clarify for me and @yangyang-nus-lv what job_queue_size does? In the documentation, it says:

Number inference jobs that frontend will queue before backend can serve. Default: 100.

What happens when it hits 100? From the wording above, I interpret it as it will only start reaching the backend/inferencing once it hits 100, but this can't be the case. I would expect that it would start dropping messages if the queue is full.

Also, even if unregistering the model frees up memory as it clears the queue, would you suggest that periodically reregister the model to free up memory?

yangyang-nus-lv commented 2 years ago

Hi @lxning I reproduced it by following your procedures. As shown below, I tagged the step for each segment of the memory usage curve. memory_usage_0122_3003 copy

# config
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=8
job_queue_size=100
vmargs=-Xmx128m -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError
prefer_direct_buffer=True
default_response_timeout=300
unregister_model_timeout=300
install_py_dep_per_model=true
# init workers
curl -v -X POST "http://localhost:8081/models?initial_workers=4&synchronous=false&url=resnet-18.mar"
# first inference
ab -q -c 2 -n 5000 -k -p resnet/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18
# second inference
ab -q -c 1 -n 5000 -k -p resnet/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18

I have some observations and questions:

Basically, we would not like to re-register the model in productions, since it interrupts the services. However it seems (manually or scheduled) 're-registering + Java GC' together is the only solution for a memory stable model serving, cause 're-registering without Java GC' still hits a higher level. Is my understanding correct? We sincerely hope there is a smoother way help us out.

HamidShojanazeri commented 2 years ago

Thanks @yangyang-nus-lv and @jlloh for sharing the configs, you should not need to re-register the model, if the root cause of this issue is the in the behavior that the queue can be reclaimed only if the model is unregistered, we would need to revisit it. We are looking into this issue and will update you.

lxning commented 2 years ago

The following graph shows the memory gets released. Detailed information can be found at Here.

Screen Shot 2022-04-07 at 9 21 53 PM
yangyang-nus-lv commented 2 years ago

@lxning thank you for the investigation. I may still have some questions, could you help me understand about the graphs? First, as we can see from the three memory allocation break down graph, the total allocated memory actually gets doubled from 1st time run to 2nd time. Also the number of KQueue goes up from 4 (both the concurrency and workers are set to be 4) to 8. So what is the total allocation here, is it a temporary value at some time frame? It seems not shown on the heap graph, as the last total memory is around 140Mb while the maximum usage on heap is sub 80Mb. And the heap curve during idle time is increasing linearly, what is reason explains that? Thanks in advance.

lxning commented 2 years ago

@yangyang-nus-lv Here is the reply for your questions

  1. "the total allocated memory actually gets doubled" In visualVM, "total allocated memory" shows how much memory was allocated during your sampling. Basically, this number is keep increasing as long as the sampling is running.

  2. "the number of KQueue goes up" the number of KQueueEventLoopGroup threads is configured by "number_of_netty_threads" and "netty_client_threads" (see config ). In my test, both of them use default value (ie. #core=16). With the incoming message rate, the number of live KQueueEventLoop is dynamic.

  3. "And the heap curve during idle time is increasing linearly" This is triggered by RMI TCP Connection thread which is used to feed visualVM (ie. my profiling tool) with data from TorchServe JVM. It is not continuously linearly increasing. It recovers every a few minutes (see the tailer part in figure1).

arnavmehta7 commented 1 year ago

Guys this seem to still be a issue

arnavmehta7 commented 1 year ago

@yangyang-nus-lv Thank you for the detail information.

TorchServe allocates a data queue for each model. The queue capacity size is 100 by default. It is configurable (see job_queue_size). The queue can be reclaimed only if the model is unregistered. That's why you can see the memory usage drop in my test case.

I can see the following setting in your test case:

  • "default_workers_per_model=1"
  • job_queue_size=100 (default)
  • ab client concurrency is 5

"#workers=1 < ab clients=5" causes the incoming requests to be accumulated in the data queue. So you can see the memory keep increasing until the queue hits the capacity size and more and more data will stay in netty server queue. The data queue will not be reclaimed even though there is no incoming data any more. The memory can be reclaimed when the model is unregistered.

I did some testing on EC2 m6i.4xlarge instance again. It can prove the memory is reclaimed once the model is unregistered.

  • config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=100
vmargs=-Xmx4g -XX:-UseLargePages -XX:+UseG1GC -XX:MaxMetaspaceSize=32M -XX:MaxDirectMemorySize=10m -XX:+ExitOnOutOfMemoryError
prefer_direct_buffer=True
default_response_timeout=300
unregister_model_timeout=300
install_py_dep_per_model=true
  • steps:
  1. export OMP_THREAD_LIMIT=1
  2. start torchserve: <== MemoryUtilization.Percent:2.0
  3. register resnet-18 with 16 workers: <== MemoryUtilization.Percent:6.5 curl -v -X POST "http://localhost:8081/models?initial_workers=16&synchronous=false&url=resnet-18.mar"
  4. inference request: <== MemoryUtilization.Percent:7.9 ab -q -c 10 -n 100000 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-1
  5. unregister resnet-18: <== MemoryUtilization.Percent:3.1 curl -X DELETE http://localhost:8081/models/resnet-18
  6. re-register resnet-18 with 16 workers: <== MemoryUtilization.Percent:7.5 curl -v -X POST "http://localhost:8081/models?initial_workers=16&synchronous=false&url=resnet-18.mar"
  7. force gc: <== MemoryUtilization.Percent:6.5 jmap pid GC.RUN
  8. inference request: <== MemoryUtilization.Percent:6.9 ab -q -c 1 -n 10 -k -p examples/image_classifier/kitten.jpg -T application/jpg http://127.0.0.1:8080/predictions/resnet-18

When the queue is full, then memory reclaims automatically?