Memory leak on some cloud providers but not others, likely from handling failed requests

csis0247 commented 4 years ago

Prerequisites

Please fill in by replacing [ ] with [x].

[x] Are you running the latest bert-as-service? Issue observed on both 1.9.1 and 1.9.6.
[x] Did you follow the installation and the usage instructions in README.md? Yes and no. I used this command in Docker:
```
RUN pip install -U bert-serving-server[http]==1.9.6
```
[x] Did you check the FAQ list in README.md?
[x] Did you perform a cursory search on existing issues?

System information

cat /etc/os-release

NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

http://[......]/status/server?_=[......]

{
   "ckpt_name":"bert_model.ckpt",
   "client":"bbcb77e6-1b05-428e-9829-1781a7bf279b",
   "config_name":"bert_config.json",
   "cors":"*",
   "cpu":false,
   "device_map":[

   ],
   "do_lower_case":true,
   "fixed_embed_length":false,
   "fp16":false,
   "gpu_memory_fraction":0.5,
   "graph_tmp_dir":null,
   "http_max_connect":10,
   "http_port":8001,
   "mask_cls_sep":false,
   "max_batch_size":256,
   "max_seq_len":256,
   "model_dir":"/app/temp/models/bert/chinese_L-12_H-768_A-12",
   "num_concurrent_socket":8,
   "num_process":3,
   "num_worker":1,
   "pooling_layer":[
      -2
   ],
   "pooling_strategy":2,
   "port":5555,
   "port_out":5556,
   "prefetch_size":10,
   "priority_batch_size":16,
   "python_version":"3.5.2 (default, Nov 12 2018, 13:43:14) \n[GCC 5.4.0 20160609]",
   "pyzmq_version":"17.1.2",
   "server_current_time":"2019-07-23 05:18:01.333383",
   "server_start_time":"2019-07-23 04:59:12.723367",
   "server_version":"1.9.6",
   "show_tokens_to_client":false,
   "statistic":{
      "avg_request_per_client":2.0,
      "max_request_per_client":2,
      "min_request_per_client":2,
      "num_active_client":0,
      "num_data_request":0,
      "num_max_request_per_client":1,
      "num_min_request_per_client":1,
      "num_sys_request":2,
      "num_total_client":1,
      "num_total_request":2,
      "num_total_seq":0
   },
   "status":200,
   "tensorflow_version":[
      "1",
      "12",
      "3"
   ],
   "tuned_model_dir":null,
   "ventilator -> worker":[
      "ipc://tmpSgGec6/socket",
      "ipc://tmpSmd58b/socket",
      "ipc://tmpM93W5h/socket",
      "ipc://tmp0a6P2n/socket",
      "ipc://tmpm8EKZt/socket",
      "ipc://tmpE1xGWz/socket",
      "ipc://tmpsZBDTF/socket",
      "ipc://tmpA3TBQL/socket"
   ],
   "ventilator <-> sink":"ipc://tmpCoApf0/socket",
   "verbose":false,
   "worker -> sink":"ipc://tmp4XOlmP/socket",
   "xla":false,
   "zmq_version":"4.2.5"
}

Description

Please replace YOUR_SERVER_ARGS and YOUR_CLIENT_ARGS accordingly. You can also write your own description for reproducing the issue.

I'm using this command to start the 3 BERT models on Kubernetes with a Docker image based on this example:

bert-serving-start -num_worker=1 -port 5555 -port_out 5556 -http_port 8000 -max_seq_len 256 -model_dir /app/bert/chinese_L-12_H-768_A-12 & bert-serving-start [AnotherModelInAnotherPort] & bert-serving-start [AnotherModelInAnotherPort]

and calling the server via:

bc = BertClient(ip='127.0.0.1', port=5555,  port_out=5556 , check_version=False)
bc.encode()

^ Not that it mattered. NO ONE was calling the server.

Then this issue shows up:

Video showing memory leak

I started 3 BERT models with bert-serving-start on Kubernetes in the same pod. Each server spawned 4 clusters of processes: 1x 1.2GB, 2x 100MB, and 1x leaking memory. No client was connected to the server. The 3 processes highlighted in the video were from the 3 models respectively, each leaking in roughly the same rate, eventually rendering the server out of memory.

What's odd was that, this severe memory leak was observed on AliCloud, but not on AWS, despite the same codebase, and the same Docker image. The only difference was hardware, and that AliCloud had 8GB of memory, versus AWS having 16GB. However, the servers were already up and running and able to handle requests with no memory issues.

Hardware on AliCloud:

    description: Computer
    width: 64 bits
    capabilities: vsyscall32
  *-core
       description: Motherboard
       physical id: 0
     *-memory
          description: System memory
          physical id: 0
          size: 7821MiB
     *-cpu
          product: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
          vendor: Intel Corp.
          physical id: 1
          bus info: cpu@0
          width: 64 bits
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm ibrs ibpb stibp fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt spec_ctrl intel_stibp

Hardware on AWS:

    description: Computer
    width: 64 bits
    capabilities: vsyscall32
  *-core
       description: Motherboard
       physical id: 0
     *-memory
          description: System memory
          physical id: 0
          size: 15GiB
     *-cpu
          product: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
          vendor: Intel Corp.
          physical id: 1
          bus info: cpu@0
          width: 64 bits
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

csis0247 commented 4 years ago

I removed the http component, and the leak is still occurring. There are now 3 clusters of processes per model.

Changes:

Instead of pip install -U bert-serving-server[http]==1.9.6, I used pip install -U bert-serving-server==1.9.6.
Removed -http_port 8000.

csis0247 commented 4 years ago

Tried to use the vanilla Docker image to start only one model, still memory leak.

However, once we removed external IP from the server, i.e. no access from outside internet, the leak was gone.

Therefore my conclusion was that someone was probing the server, probably at port 5555, and failing (because CPU usage was quite low). Yet the memory leak was real, because despite failing to trigger the server to encode, memory usage was increasing. Maybe the leak arose from handling failed requests?

boxabirds commented 4 years ago

Sorry I can't help with this but I couldn't help noticing your memory footprints were quite low compared to what they look like on my local dev machine. For English BERT-large it was startup size of 16GB and stable at around 6-7GB. Is the Chinese model much smaller?

csis0247 commented 4 years ago

Sorry I can't help with this but I couldn't help noticing your memory footprints were quite low compared to what they look like on my local dev machine. For English BERT-large it was startup size of 16GB and stable at around 6-7GB. Is the Chinese model much smaller?

I am using the BERT-Base model which has only 1/3 the parameters. So yes, the Chinese model is much smaller because BERT-Large is not available for Chinese.

jina-ai / clip-as-service

Memory leak on some cloud providers but not others, likely from handling failed requests #424

Description