Improve performance in CPU-bound programs

aantron commented 5 years ago

Some CPU-bound Lwt programs call Lwt.pause () a lot, in order to yield to any potential I/O that may be pending.

However, doing this results in Lwt calling select/kevent/epoll at the rate that Lwt.pause () is called. This forces the Lwt user to worry about Lwt implementation details, and to think about how often they are calling Lwt.pause ().

We should probably have Lwt adapt automatically by checking if polling I/O the last time actually resolved any promises, and, if not, skipping calls to select, etc., on future scheduler iterations, with some kind of limited exponential backoff, or a similar scheme.

See https://discuss.ocaml.org/t/2567/5. This will also improve performance of CPU-bound Repromise programs.

cc @kennetpostigo

Lupus commented 4 years ago

We've also faced this problem. While testing fairness of large data stream processing with httpaf we observed that one stream hogs all of the processing power and all other streams just time out.

Probably due to data coming at high rate, read loop keeps reading from the socket without blocking (and thus invoking event loop).

We just added Lwt_main.yield () in-between read loop steps, but that resulted in insane rate of calls to epoll and tremendous slowdown as the result.

We're thinking about yielding once per certain amount of bytes read, but that looks a bit weird to solve at application level. This issue has libuv milestone, does that mean that implementing some heuristic within current Lwt is not considered viable? Does application-level workaround have any drawbacks compared to heuristic within Lwt itself?

aantron commented 4 years ago

You may also find Lwt_unix.auto_yield useful, with current Lwt:

https://github.com/ocsigen/lwt/blob/336566dd63d7f948234a4f30919df329f25b0e3c/src/unix/lwt_unix.cppo.mli#L58-L65

This is indeed best solved inside the scheduler. The only reason for the libuv milestone is that until now, the only places I had observed this issue, that definitely required an in-library fix, were related to some libuv work I was doing (in repromise_lwt and luv).

Do you have time to work on this in Lwt? If not, could you share your test/benchmark so I can use it to measure effects of various approaches, when I work on this (slightly later)?

Lupus commented 4 years ago

I'll try using auto_yield, looks like it does not have any dependencies on Lwt internals and I can just try embedding it in my service directly. So far it looks like it's going to solve the issue with unfair streams.

As of benchmark, basically we just replicate lwt_echo_post.ml example from httpaf with our service. Httpaf example itself should be sufficient to illustrate the issue. I recommend modifying it like below to avoid excessive buffering (see httpaf/139 for more context):

--- a/examples/lib/httpaf_examples.ml
+++ b/examples/lib/httpaf_examples.ml
@@ -39,8 +39,9 @@ module Server = struct
       let request_body  = Reqd.request_body reqd in
       let response_body = Reqd.respond_with_streaming reqd response in                                                                                                      
       let rec on_read buffer ~off ~len =                                                                                                                                    
-        Body.write_bigstring response_body buffer ~off ~len;                                                                                                                
-        Body.schedule_read request_body ~on_eof ~on_read;                                                                                                                   
+        Body.schedule_bigstring response_body buffer ~off ~len;                                                                                                             
+        Body.flush response_body (fun () ->                                                                                                                                 
+        Body.schedule_read request_body ~on_eof ~on_read);                                                                                                                  
       and on_eof () =                                                                                                                                                       
         Body.close_writer response_body                                                                                                                                     
       in

Clients are simple curl launches like this:

dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/ -o /dev/null

You might need to add header specifying chunked-encoding response, but it should work without it as well.

I might find some time in future to try implementing this in Lwt, depending on mitigation effect from auto_yield :)

Lupus commented 4 years ago

Looks like auto_yield(0.05) does not cut it so far... I'll try lower values, but it already starts hurting performance.

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.1G    0 6230M    0 6242M   115M   115M --:--:--  0:00:54 --:--:--  176M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9370M    0 4679M    0 4690M  75.6M  75.7M --:--:--  0:01:01 --:--:--  103M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.4G    0 13.2G    0 13.2G   111M   111M --:--:--  0:02:01 --:--:-- 34.0M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 45.8G    0 22.9G    0 22.9G   153M   153M --:--:--  0:02:32 --:--:-- 6608k

Lupus commented 4 years ago

But changing yield interval to each X bytes read/written works better! Yielding each megabyte gives nearly the same perf as without yields, but multiple streams share the bandwidth fairly.

aantron commented 4 years ago

Ok that's good :)

aantron commented 4 years ago

@Lupus, I guess another library solution to your case would be to add a variant of yield that yields only to other callbacks (CPU) and not I/O. But before adding something like that, I would want to see if there is a generic solution that addresses all these cases, some variant of what I described in the main comment of this issue.

Lupus commented 4 years ago

a variant of yield that yields only to other callbacks (CPU) and not I/O

Yeah, that should probably work as well. When all of your sockets always have data to read, you don't need event loop iteration :) On the other hand when there's a slow connection happening along with a fast one, there won't be any fairness in this scenario, fast one will hog CPU if it uses "yield only to other CPU guys" strategy.

ocsigen / lwt

Improve performance in CPU-bound programs #622