Measuring invocation latency

mcopik commented 3 years ago

Hi!

I've been trying to execute simple functions with nightcore. To that end, I followed the installation instructions, and I modified the example of a function to even a simpler one:

int faas_func_call(void* worker_handle, const char* input, size_t input_length) {
    struct worker_context* context = (struct worker_context*)worker_handle;
    context->append_output_fn(context->caller_context, input, input_length);
    return 0;
}

[
    { "funcName": "Foo", "funcId": 1, "minWorkers": 1, "maxWorkers": 1 }
]

When deploying the controller and engine on one server, and using curl to send POST requests from another server, I measure latencies around ~100ms. The TCP/IP RTT between both machines is ~12us. When I increase the size of input JSON, the RTT quickly increases to over a second for an invocation on a 1 kB message. On the same cluster of machines, we've been able to deploy OpenWhisk and execute functions with a 1 kB payload with a latency of around ~100ms.

Is there are any other way to invoke nightcore functions that don't involve such high latencies? Is there a potential problem in our deployment that might cause such high latencies? Is there documentation on optimizing deployments?

zhipeng-jia commented 3 years ago

Could you try adding "--v=1" to the command lines of gateway, engine, and launcher, which enables verbose logging?

Verbose logging might provide some information into which contributes to 100ms latency.

mcopik commented 3 years ago

@zhipeng-jia Thanks for a swift reply!

(edited) I noticed that the default run_stack.sh script already had the verbose logging enabled, and I also found a mistake in interpreting our results. After disabling logging, the latency of submitting an invocation with just a few bytes dropped to ~60-90us (not milliseconds) - is that what you would expect? It looks correct and aligns with the number, given the close proximity of the servers. However, when sending an input JSON with ca. 1 kB, the latency is still around one second. This seems high when compared to OpenWhisk (~100-200ms even on 100kB input).

Here's the test code we use; anything particular that might cause unnecessary slowdown on your gateway?

static size_t WriteCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
    ((std::string*)userp)->append((char*)contents, size * nmemb);
    return size * nmemb;
}

// part of test routine:

  std::string readBuffer;
  // generate input of given size
  std::string out = base64_encode(test);

  curl = curl_easy_init();
  std::string post = "{\"in\": \"" + out + "\"}";
  int len;
  char * str = curl_easy_unescape(curl, url.c_str(), url.length(), &len);
  curl_easy_setopt(curl, CURLOPT_URL, str);

  struct curl_slist *list = NULL;
  list = curl_slist_append(list, "Content-Type: application/json");
  curl_easy_setopt(curl, CURLOPT_HTTPHEADER, list);
  curl_easy_setopt(curl, CURLOPT_POST, 1);
  curl_easy_setopt(curl, CURLOPT_POSTFIELDS, post.c_str());
  curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
  curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
  for(int i = 0; i < repetitions + 1; ++i) {
    readBuffer.clear();
    auto start = std::chrono::high_resolution_clock::now();
    res = curl_easy_perform(curl);
    auto end = std::chrono::high_resolution_clock::now();
    if(i > 0)
      std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 1000.0 << '\n';
  }

Please find attached the logs for 10 executions with verbose logging (again, latency ~1s).

input_1.txt engine.log Foo_worker_0.stderr.log gateway.log launcher_foo.log

zhipeng-jia commented 3 years ago

Thanks for the log. It provides some useful information.

Looks like the problem is here:

I0523 14:00:55.934223 IO-0/EL  http_connection.cpp:283] HttpConnection[0]: Parse new HTTP header: Expect = 100-continue
I0523 14:00:56.935250 IO-0/EL  http_connection.cpp:254] HttpConnection[0]: Start parsing URL: /function/Foo

These two lines have a duration of 1 second. The first line is the moment of parsing the last HTTP header, while the second line is the moment that the full HTTP request is received (see https://github.com/ut-osa/nightcore/blob/asplos-release/src/gateway/http_connection.cpp#L254). When the full request is received, the gateway then sends the function call to engine.

I am not sure what causes here to consume 1 second. I am also not familiar with curl's C APIs, but I do not see obvious problems from the client code that you show.

zhipeng-jia commented 3 years ago

After searching around, I guess the "Expect: 100-continue" (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expect) may cause the problem. I do not implement the required logic for handling expect header, i.e. sending response before receiving HTTP body.

Here is what I found how to disable it for libcurl: https://stackoverflow.com/questions/49670008/how-to-disable-expect-100-continue-in-libcurl

mcopik commented 3 years ago

@zhipeng-jia Thanks for your help! Now it seems to work. We get ~150 us for 1 kB messages and 20ms for 5 MB messages.

I'm closing the issue as this has been resolved.

zhipeng-jia commented 3 years ago

Good to know!

mcopik commented 3 years ago

@zhipeng-jia Hi! I have one more question - I'm trying to measure the speedups achieved by integrating nighcore. I want to dispatch between 8 and 32 functions to the same instance.

Besides changing the JSON configuration file with the minimal and maximal number of workers, is there anything else that I should pay attention to? I'm going to run C/C++ functions.

zhipeng-jia commented 3 years ago

If you're going to measure under a fixed concurrency, setting both max_worker and min_worker to the desired concurrency will work.

mcopik commented 3 years ago

@zhipeng-jia Thanks so much! And just to verify: are the POST requests the only way of invoking nightcore functions externally? I'm asking because we have to transmit some floating-point numbers, and maybe there's a way of skipping float->string->float serializations.

zhipeng-jia commented 3 years ago

I have also implemented gRPC transport, but needs more setup and not tested extensively.

If it does not cause real problems, I'd suggest use HTTP POST.

ut-osa / nightcore

Measuring invocation latency #2