princeton-sns / firecracker-tools

5 stars 5 forks source link

Workload generator, measurement infrastructure and utilization plotting #22

Closed LedgeDash closed 5 years ago

LedgeDash commented 5 years ago

workload generator (updated)

Generate workload by specifying

  1. average inter-arrival time (mu) in ms
  2. spike window (expressed in start time and end time)

for each function in a workload characteristic YAML file.

Call it with: python3 generator.py <workload.yaml> <request.json> <workload.yaml> is the input. <request.json> is the output request file

Here's an example example_workload.yaml

- name: "loremjs"
  mu: 2 # average inter-arrival time in ms. 1000/mu=arrival rate per sec
  start_time: 0 # ms
  end_time: 5000 # ms                                                   
- name: "lorempy2"
  mu: 2
  start_time: 5000
  end_time: 10000

This YAML generates a workload of 2 functions loremjs and lorempy2. loremjs's spike begin at timestamp=0ms and ends at timestamp=5000ms. lorempy2's spike begin at timestamp=5000ms and ends at timestamp=10000ms. During non-spike periods (that is [5000, 10000] for loremjs and [0,5000] for lorempy2), functions have a default average inter-arrival time of 1000ms (i.e., average of 1 req/sec), currently hard-coded.

If you want a function to have multiple spikes, you would for example do this:

- name: "loremjs"
  mu: 2 # average inter-arrival time in ms. 1000/mu=arrival rate per sec
  start_time: 0 # ms
  end_time: 5000 # ms                                                   
- name: "lorempy2"
  mu: 2
  start_time: 5000
  end_time: 10000
- name: "loremjs"
  mu: 2 
  start_time: 7000 # ms
  end_time: 15000 # ms 

It is desirable to list functions in chronological order so that the workload YAML file basically outlines the timeline of the workload.

Time measurement

In order to measure utilization over time (think of a plot of utilization where x-axis is time), I added code in controller that outputs timestamps when certain events happen. These events are: VM boot start (calling the .run() function on VmAppConfig), controller receiving the tty ready message from a VM, request sent, response received, VM eviction start, VM eviction finished.

All results are output as a json string to ./measurement/measurement-<start_time>-<end_time>.json file, where is the timestamp of experiment start time (i.e., right before the first request is scheduled) and is the timestamp of experiment finished.

In the json string, you will see something like this:

{
  "boot timestamps": {
    "10": [
      14283896112730,
      14284027647863
    ],
    "11": [
      14283907501412,
      14283983962431
    ],
  ...
 },
"eviction timestamps": {
  "7": [
    14283953935355,
    14283991563487
  ]
},
"request/response timestamps": {
  "10": [
    14283897402698,
    14284037372778,
    14284042663022,
    14284045126842,
    14284052737579,
    14284055072850
  ],
...
}

}

(the ... just means there are more elements, but omitted to save space) For example, "request/response timestamps" object has all the timestamp of all requests and response from and to all VMs. The "10" is VM id. the 6 numbers are 3 pairs of (request_send_timestamp, response_received_timestamp), sequentially.

With all these information, we then know exactly what happened when throughput the entire duration of a workload. The plot.py script takes this information to calculate and plot utilization.

Next steps (need Cgroup CPU share feature completed first):

  1. plot.py need to also take in function config yaml file to know each function's memory requirement
  2. utilization calculation need to base off of total memory size, not core count
alevy commented 5 years ago

A lot of the changes in this PR are now unrelated to the title so it's hard to review. I have some substantive comments about the workload generator, and I will mark up the rest as I see things, but I think we should merge this ASAP

LedgeDash commented 5 years ago

Yes, I think it'd be great to merge this asap. I've updated the title since the PR content has changed quite significantly. Nevertheless, the workload generator code is separate enough from measurement and plotting related code such that we could just treat this PR as 2 PRs combined.

I'm updating the PR description (the first comment) to include my new changes.

LedgeDash commented 5 years ago

I've updated the workload generator incorporating @alevy 's comments. Inspired by @alevy 's insight that we could get rid of num_invocations altogether. I changed the interface to specifying a spike window for each function. Details are in my updated PR description (first comment). This I believe makes the generator more flexible and nicely makes the workload YAML file a chronological outline of a workload which is more intuitive and easier for our experiments' purposes.

I also got rid of arrival rates as that was a source of confusion and switched directly to mu the average inter-arrival time. And all time values are now in ms.