[leo_gateway] PUT/POST a large file causes higher load and memory consumption

mocchira commented 6 years ago

Background shared by @vstax

The POST/PUT operations for large objects (which will be stored as multipart objects) create higher load on leo_gateway compared to (client-assisted) multipart uploads in S3 mode. memory requirements increase a lot especially when uploading very large objects.

Solution

As it's probably caused by long running erlang processes (spawned by ranch and leo_large_object_put_handler) holding the data received from a client, Invoking garbage_collect on both processes manually at a certain period could mitigate the high memory consumption.

vstax commented 6 years ago

A good way to see this is to create a large file, e.g.

dd if=/dev/zero of=500mb bs=1M count=1 seek=499

Then upload it in parallel with

for a in `seq 1 10`; do curl -v -X PUT http://192.168.3.52:8080/test/500mb-$a @500mb & done

Near the end of upload of each file the gateway will be consuming gigabytes of memory (it's easier to see this if connection speed to gateway is limited, e.g. 100 Mbit network or with tool like trickle).

mocchira commented 6 years ago

@vstax Thanks for sharing. (trickle looks pretty handy compared to other similar products

mocchira commented 6 years ago

WIP. A simple hack below doesn't solve the problem.

diff --git a/apps/leo_gateway/src/leo_gateway_http_commons.erl b/apps/leo_gateway/src/leo_gateway_http_commons.erl
index 47b631f..e170b49 100644
--- a/apps/leo_gateway/src/leo_gateway_http_commons.erl
+++ b/apps/leo_gateway/src/leo_gateway_http_commons.erl
@@ -894,6 +894,7 @@ put_large_object_1({more, Data, Req},
                                   transfer_decode_fun = TransferDecodeFun,
                                   transfer_decode_state = TransferDecodeState
                                  } = ReqLargeObj) ->
+    erlang:garbage_collect(),
     case catch leo_large_object_put_handler:put(Handler, Data) of
         ok ->
             BodyOpts = [{length, ReadingChunkedSize},
diff --git a/apps/leo_gateway/src/leo_large_object_put_handler.erl b/apps/leo_gateway/src/leo_large_object_put_handler.erl
index 4862048..88736a5 100644
--- a/apps/leo_gateway/src/leo_large_object_put_handler.erl
+++ b/apps/leo_gateway/src/leo_large_object_put_handler.erl
@@ -173,6 +173,7 @@ handle_call({put, Bin}, _From, #state{bucket_info = BucketInfo,
                                                      total_len = TotalLen_1,
                                                      monitor_set = MonitorSet}}
                 end,
+            erlang:garbage_collect(),
             {reply, Ret, State_1};
         false ->
             {reply, ok, State#state{stacked_bin = Bin_1,

I will look into further with recon later.

mocchira commented 6 years ago

With recon, it turned out there has been memory fragmentation problems while handling PUT large objects like below (vcpu:2)

(gateway_0@127.0.0.1)32> recon_alloc:fragmentation(current).
[{{binary_alloc,1},
  [{sbcs_usage,0.9523951492030105},
   {mbcs_usage,0.07784016927083333},
   {sbcs_block_size,68158456},
   {sbcs_carriers_size,71565312},
   {mbcs_block_size,168344},
   {mbcs_carriers_size,2162688}]},
 {{binary_alloc,2},
  [{sbcs_usage,0.9523882184709821},
   {mbcs_usage,0.2059326171875},
   {sbcs_block_size,104858400},
   {sbcs_carriers_size,110100480},
   {mbcs_block_size,13496},
   {mbcs_carriers_size,65536}]},

(gateway_0@127.0.0.1)6> recon_alloc:fragmentation(current).
[{{binary_alloc,1},
  [{sbcs_usage,0.9523882184709821},
   {mbcs_usage,0.273681640625},
   {sbcs_block_size,115344240},
   {sbcs_carriers_size,121110528},
   {mbcs_block_size,161424},
   {mbcs_carriers_size,589824}]},
 {{binary_alloc,2},
  [{sbcs_usage,0.9523970831008185},
   {mbcs_usage,0.1814236111111111},
   {sbcs_block_size,104859376},
   {sbcs_carriers_size,110100480},
   {mbcs_block_size,107008},
   {mbcs_carriers_size,589824}]},

As shown above, the lower mbcs_usage has indicated a typical fragmentation symptom, and also leo_gateway has had many large binaries during high load so as a result, the default sbct (single block carrier threshold: 512K) push many binaries into single block carriers(sbcs) which is not preferable a bit (mbcs are more preferable than sbcs) because mbcs basically represent pre-allocated memory, whereas sbcs will map to either call sys_alloc or mseg_alloc, which is more expensive than redistributing data that was obtained for multiblock carriers.

That being said, this issue might be more or less memory tuning problem.

mocchira commented 6 years ago

With the patch below applied to leo_gateway.schema,

%% tunes for memory fragmentation
{mapping,
 "erlang.memory.binary.alloc_strategy",
 "vm_args.+MBas",
 [
  {datatype, {enum, [bf, aobf, aoff, aoffcbf, aoffcaobf, gf, af]}},
  {default, 'aobf'}
 ]}.

%% tunes for handling large sized binaries
{mapping,
 "erlang.memory.binary.sbct",
 "vm_args.+MBsbct",
 [
  {datatype, integer},
  {default, "2147483648"}
 ]}.

{mapping,
 "erlang.memory.binary.lmbcs",
 "vm_args.+MBlmbcs",
 [
  {datatype, integer},
  {default, "20480"}
 ]}.

{mapping,
 "erlang.memory.binary.smbcs",
 "vm_args.+MBsmbcs",
 [
  {datatype, integer},
  {default, "1024"}
 ]}.

got the following output from recon

(gateway_0@127.0.0.1)5> recon_alloc:fragmentation(current).
[{{binary_alloc,1},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.46640138992469443},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,78768792},
   {mbcs_carriers_size,168886272}]},
 {{binary_alloc,2},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.6530034128528812},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,110283312},
   {mbcs_carriers_size,168886272}]},

Fragmentation problem has gone and every binaries now has been stored into mbcs as I intended however the memory usage keep being high during PUT workload so I will keep digging.

mocchira commented 6 years ago

Reference (MUST read):

mocchira commented 6 years ago

Since a PUT with a large object is handled in parallel (as many processes being spawned through leo_pod as possible) on leo_gateway, this might affect the maximum memory usage so I'll benchmark to compare the current IMPL (in parallel) and the other (sequential) in terms of the maximum memory usage, throughput, ops and latency. Depends on the result, we may decide to make it configurable to choose the way how to handle a PUT with a large object whether parallel or sequential.

windkit commented 6 years ago

First of all, I want to ask how bad was it. We did solve similar problem with large object write handler in 1.3.1 https://github.com/leo-project/leofs/issues/570

At the same time, I would try to re-do the test to check degradation in this aspect.

mocchira commented 6 years ago

@windkit

First of all, I want to ask how bad was it.

Try https://github.com/leo-project/leofs/issues/984#issuecomment-362363741 vstax procedure using trickle (to keep the memory usage high for a long time) on any linux with OOM enabled. leo_gateway will be killed at least on my dev-box.

We did solve similar problem with large object write handler in 1.3.1 #570

It seems you've fixed the problem at that time, great. however the problem seems still remained (I also tried to set the number of worker of leo_pod to very small but no luck).

At the same time, I would try to re-do the test to check degradation in this aspect.

Great please do it.

mocchira commented 6 years ago

Note: Test environment

Virtualbox
vCPU: 2
Memory: 8G
OS: ubuntu 16.04LTS
Erlang: 19.x (the latest retrieved throug kerl with systemd enabled)
LeoFS: develop on the tip

windkit commented 6 years ago

I have confirmed the issue exists when the GW bandwidth is limited to 10MB/s (in 100MB/s it works as before)

windkit commented 6 years ago

While I am still checking the root cause, there is something I don't feel right in leo_large_object_put_handler https://github.com/leo-project/leofs/blob/master/apps/leo_gateway/src/leo_large_object_put_handler.erl#L309

when all the leo_pod workers are out, processes would wait 30 seconds before timeout. This seems open up a way to consume huge amount of resources.

(In general, I don't think a time out waiting in front of a work queue is a good idea, immediate discard is needed to limit the resource usage.)

mocchira commented 6 years ago

@windkit Great catch. I'm sure this must be the culprit. To limit the memory usage as we expected, we have to restrict at the first place where the large shared binary is generated.

We will be able to achieve our goal If we use the leo_pod to restrict the call to cowboy_req:body. Could you spare time for this?

mocchira commented 6 years ago

@windkit I discussed with @yosukehara and your most high priority task is #114 so that please proceed this task ONLY if that doesn't affect your high priority one.

mocchira commented 6 years ago

After the discussion with @yosukehara, we've decided to postpone this issue to 1.4.1 as this is not critical (there is a workaround using multipart-upload with S3 mode enabled). so please don't take your time for this during 1.4.0 dev-cycle.

Anyway for the record, I'll add a comment below on how to solve this issue.

We should not use leo_pod to restrict the memory usage as the large binary allocation happens on a process spawned by cowboy and the large binary free (to be precise, become a GCed target) happens on a process managed by leo_pod (that said, binary allocations: leo_pod = N : 1 relation). so leo_pod should be used ONLY for the process pooling to handle parallel PUT workloads so we should solve the memory usage problem with the another solution. I think a simple semaphore alike or a simple counting (INC/DEC operations can be called from any process) would better suite for this problem.

mocchira commented 6 years ago

so leo_pod should be used ONLY for the process pooling to handle parallel PUT workloads so we should solve the memory usage problem with the another solution. I think a simple semaphore alike or a simple counting (INC/DEC operations can be called from any process) would better suite for this problem.

WIP: https://github.com/mocchira/leofs/commit/15513c47e158bbd93f15048053abba47ec07b02a

The new module - leo_throttle (realtime resource throttling) works as expected, however the memory consumption still goes over the limit. maybe there are still some place where explicit erlang:garbage_collect should be called.

mocchira commented 6 years ago

It turns out that simple web applications using only cowboy_req:body cause the same problem under concurrent connections uploading a large file with the single part upload. That being said, there is nothing Leofs can do for now to solve this problem. Since our home grown cowboy was based off the very old version of the original one, we will confirm that this problem can be fixed with the latest cowboy once https://github.com/leo-project/leofs/issues/1007 get fixed.

leo-project / leofs

[leo_gateway] PUT/POST a large file causes higher load and memory consumption #984

Background shared by @vstax

Solution