Using fatcache - Githubissues

annief commented 11 years ago

Hi, is fatcache currently in production use? I'm interested in contributing to optimizing memcache for SSDs, and would like to start with performance analysis to expose optimization possibilities. Should I start with fatcache or is there another deployment you would recommend? Thanks.

thinkingfish commented 11 years ago

We are currently not running fatcache in production. Given the work at hand (a data insight project), my guess is we will start looking at the multiple cache solutions at the beginning of next year, and make a case for using fatcache on the long-tail part as we gain more insight into the data.

annief commented 11 years ago

Hi Yue,

Thank you. I am a SSD architect at Intel. I’d like to determine performance of memcache on our future SSDs. Would this be useful to fatcache? Perhaps a chance for collaboration?

Annie

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, September 18, 2013 4:59 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

We are currently not running fatcache in production. Given the work at hand (a data insight project), my guess is we will start looking at the multiple cache solutions at the beginning of next year, and make a case for using fatcache on the long-tail part as we gain more insight into the data.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-24708701.

manjuraj commented 11 years ago

@annief the current perf numbers with commodity intel 320 ssd (https://github.com/twitter/fatcache/blob/master/notes/spec.md) can be found here: https://github.com/twitter/fatcache/blob/master/notes/performance.md

One of the designs in fatacache that kind of limits it from getting more out of ssd is the fact that IO to disk is sync. Given the fact that fatcache is single threaded and uses asynchronous IO for network, the synchronous IO to SSD is sort of limiting its performance (throughput to be more precise). I believe we can make fatcache IO async by using epoll and linux aio framework -- https://code.google.com/p/kernel/wiki/AIOUserGuide. If we can fix this, fatcache would become a viable alternative for in-memory caches.

thinkingfish commented 11 years ago

We have a pull request adding async support to fatcache and I should look at it real soon.

A collaboration would be fantastic.

annief commented 11 years ago

I see - being both single threaded and synchronous can be a limit. I looked at the performance numbers published, and it seemed that currently, you are at the SSD limit, not quite sync IO yet.

So that you know, I am very new to fatcache. I appreciate you helping me understand fatcache better.

Would making fatcache multithreaded (vs async) be a better longer term scaling strategy?

is fatcache an extension of twemcache or memcache with SSD consideration?  does it not inherit memcache’s existing multithreaded framework?

what in fatcache inherently require serialization? I assume that’s why you chose a single thread implementation?

Is it useful if I were to do a performance analysis to determine what performance bottlenecks are worthy of fixing in fatcache?

From: Manju Rajashekhar [mailto:notifications@github.com] Sent: Wednesday, September 18, 2013 5:38 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

@anniefhttps://github.com/annief the current perf numbers with commodity intel 320 ssd (https://github.com/twitter/fatcache/blob/master/notes/spec.md) can be found here: https://github.com/twitter/fatcache/blob/master/notes/performance.md

One of the designs in fatacache that kind of limits it from getting more out of ssd is the fact that IO to disk is sync. Given the fact that fatcache is single threaded and uses asynchronous IO for network, the synchronous IO to SSD is sort of limiting its performance (throughput to be more precise). I believe we can make fatcache IO async by using epoll and linux aio framework -- https://code.google.com/p/kernel/wiki/AIOUserGuide. If we can fix this, fatcache would become a viable alternative for in-memory caches.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-24710198.

manjuraj commented 11 years ago

Happy to help with any questions you have about fatcache and/or its architecture. Please feel free to ask anything

Would making fatcache multithreaded (vs async) be a better longer term scaling strategy?

I'm betting on async disk IO. If you look at the perf numbers in performance.md, you will notice that the trick is in increasing the average queue size to get 100% utilization (see %util iostat numbers). In order to do that I had to run 8 instances of fatcache on a single ssd. It would be nice get 100% util with single just fatcache instance.

Furthermore, as we evolve the fatcache architecture it would be nice to incorporate the fact that a single fatcache instance can talk to multiple ssd's at the same time (think of lvm) which would essentially allow us to use commodity ssd that have limited parallelism (queue length)

is fatcache an extension of twemcache or memcache with SSD consideration? does it not inherit memcache’s existing multithreaded framework?

No it doesn't. It implements memcache ascii protocol (https://github.com/twitter/fatcache/blob/master/notes/memcache.txt). But instead of being multithreaded, it architecture is single threaded which was popularized by key-value stores like redis.

Single threaded makes sense because you can always run multiple instances of fatcache on one machine or multiple machines to scale horizontally and have some kind of sharding layer on the client to route traffic to one of the fatcache instances

what in fatcache inherently require serialization? I assume that’s why you chose a single thread implementation?

single threaded makes reasoning simple :) There is nothing in fatcache that is CPU intensive that warrants it to be multithreaded. Most of the work in fatchcache is handling network IO and disk IO. Currently Network IO is async, but disk IO is not. Sync disk IO is limiting fatcache performance

Is it useful if I were to do a performance analysis to determine what performance bottlenecks are worthy of fixing in fatcache?

absolutely!

annief commented 11 years ago

I understand what you mean now. You are counting on multi-instances (processes) to scale, not necessarily multi-threads. Makes good sense.

Are the default settings optimal for a fatcache/memcache deployment? Looking for help from folks who runs oversees a memcache deployment to help get the right settings. e.g. do we set a higher MTU size if the median value size is larger than the 1500 TCP payload size? i.e. for the common size that a specific memcache pool sees, would we expect a get/set to be splitted across multiple TCP packets?

Thanks.

From: Manju Rajashekhar [mailto:notifications@github.com] Sent: Sunday, September 22, 2013 12:32 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

Happy to help with any questions you have about fatcache and/or its architecture. Please feel free to ask anything

Would making fatcache multithreaded (vs async) be a better longer term scaling strategy?

I'm betting on async disk IO. If you look at the perf numbers in performance.md, you will notice that the trick is in increasing the average queue size to get 100% utilization (see %util iostat numbers). In order to do that I had to run 8 instances of fatcache on a single ssd. It would be nice get 100% util with single just fatcache instance.

Furthermore, as we evolve the fatcache architecture it would be nice to incorporate the fact that a single fatcache instance can talk to multiple ssd's at the same time (think of lvm) which would essentially allow us to use commodity ssd that have limited parallelism (queue length)

is fatcache an extension of twemcache or memcache with SSD consideration? does it not inherit memcache’s existing multithreaded framework?

No it doesn't. It implements memcache ascii protocol (https://github.com/twitter/fatcache/blob/master/notes/memcache.txt). But instead of being multithreaded, it architecture is single threaded which was popularized by key-value stores like redis.

Single threaded makes sense because you can always run multiple instances of fatcache on one machine or multiple machines to scale horizontally and have some kind of sharding layer on the client to route traffic to one of the fatcache instances

what in fatcache inherently require serialization? I assume that’s why you chose a single thread implementation?

single threaded makes reasoning simple :) There is nothing in fatcache that is CPU intensive that warrants it to be multithreaded. Most of the work in fatchcache is handling network IO and disk IO. Currently Network IO is async, but disk IO is not. Sync disk IO is limiting fatcache performance

Is it useful if I were to do a performance analysis to determine what performance bottlenecks are worthy of fixing in fatcache?

absolutely!

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-24888746.

thinkingfish commented 11 years ago

do we set a higher MTU size if the median value size is larger than the 1500 TCP payload size? i.e. for the common size that a specific memcache pool sees, would we expect a get/set to be splitted across multiple TCP packets?

We do use jumbo frames when possible. Single get/set can be split across multiple packets, but for most workloads we observe, simple key/value requests are usually smaller than one MTU, even w/o the jumbo frame config.

archier commented 11 years ago

This is a great discussion.

I have a couple questions

Are there any other performance numbers available? I'm curious about how things work out when storing larger values.

Would it make sense to use several SSDs in a box instead of one?

annief, have you been testing this on your newer hardware?

Thanks!

annief commented 11 years ago

Hi Archier,

I am behind in what I said I would do. No performance numbers yet. My apologies. Will get back into that mode.

Per your question on size, the way I think about it is: Assuming that 4KB page size map directly to a SSD block, storing larger values would scale accordingly.

per your question on # of SSDs, if your goal is to achieve the maximum possible throughput, more SSDs is better. However, I believe a better but more difficult goal is to design the best cost/performance memcache/fatcache node. if that'sthe case, we may need more SSDs for capacity or Bandwidth, as bounded by cost of the system. We need to turn to folks who run a production memcache workload to help answer:

what is the data set size of a typical workload?
what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.
what is the reduction of TCO possible when we replace DRAM with SSDs?

Can someone who has experience with production memcache help guide us on these questions?

Thanks!

archier commented 11 years ago

Hi Annie, looks like you closed the issue?

We run memcache in production, get in touch.

annief commented 11 years ago

Hi Archie,

I clicked the wrong button! let me reopen the issue.

what is a good next step to incorporate your workload insights?

From: Archie R [mailto:notifications@github.com] Sent: Tuesday, October 29, 2013 5:11 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

Hi Annie, looks like you closed the issue?

We run memcache in production, get in touch.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27355464.

archier commented 11 years ago

Hi Annie just drop me a message -- my email's on my Git page.

On Tue, Oct 29, 2013 at 6:09 PM, annief notifications@github.com wrote:

Hi Archie,

I clicked the wrong button! let me reopen the issue.

what is a good next step to incorporate your workload insights?

From: Archie R [mailto:notifications@github.com] Sent: Tuesday, October 29, 2013 5:11 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

Hi Annie, looks like you closed the issue?

We run memcache in production, get in touch.

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27355464>.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27357916 .

manjuraj commented 11 years ago

page 2 of this link as few numbers -- https://speakerdeck.com/manj/caching-at-twitter-with-twemcache-twitter-open-source-summit; it is a year old though

@thinkingfish can possibly answer your questions

thinkingfish commented 11 years ago

Are there any other performance numbers available? I'm curious about how things work out when storing larger values.

Regarding latencies, here is a profile of Twemcache by Rao Fu: https://github.com/twitter/twemcache/wiki/Impact-of-Lock-Contention

When it comes to throughput, because of the existence of multiget and the locking pushdown in later memcached trunk, the numbers will be different between memcached/Twemcache. However, to give you a ballpark estimate, on an up-to-date datacenter frontend server, it should be fairly easy to get 100k qps on all commands with Twemcache.

Would it make sense to use several SSDs in a box instead of one?

This depends on the bottleneck. The overly simplified answer is "yes". If a single Gigalink is used, there is a chance that network proves to be the bottleneck before disk, or CPU & memory. In the case of 10G NICs, as long as we have enough memory for indexing, the load on CPU is generally low so that we can support a higher request rate with better hit rate by having more data available on the same box. Some issues like logistics within the DC (at what point do multiple SSDs start taking more space? etc) will affect the decision too, though I'm not clear about how.

-Yao (@thinkingfish)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

Hi Archier,

I am behind in what I said I would do. No performance numbers yet. My apologies. Will get back into that mode.

Per your question on size, the way I think about it is: Assuming that 4KB page size map directly to a SSD block, storing larger values would scale accordingly.

per your question on # of SSDs, if your goal is to achieve the maximum possible throughput, more SSDs is better. However, I believe a better but more difficult goal is to design the best cost/performance memcache/fatcache node. if that'sthe case, we may need more SSDs for capacity or Bandwidth, as bounded by cost of the system. We need to turn to folks who run a production memcache workload to help answer:

what is the data set size of a typical workload?

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

what is the reduction of TCO possible when we replace DRAM with SSDs?

Can someone who has experience with production memcache help guide us on these questions?

Thanks!

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27352867 .

thinkingfish commented 11 years ago

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most datasets: a. small objects (smaller than a single MTU) dominate in number in many cases, partly because the use of cache for counters, flags, soft locks prevail, partly because sending very large objects over the network hurts the latencies and stress the network backbone, and hence are usually mitigated by local caching. b. wide range of sizes. Even we established a, it is also true that we have seen data sizes from a single integer to 1MiB (default slab size). Even though a particular workload doesn't cover the whole spectrum, it nonetheless usually covers several slab classes if you go with the default exponential growth with f=1.4; and max/min is often more than an order of magnitude. This is especially true if multitenacy is adopted. c. in terms of total data cached, it will be very application dependent, from a few gigs to tens of terabytes. A more important parameter would be "request to size" ratio. Because it tells you whether CPU/bandwidth or memory/storage is likely to be the bottleneck. So far, observation is that we generally fall on the side of allocating for data size (in contrast to for throughput), indicating that switching to a cheaper storage medium will save money.

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of adoption); 10G NIC can be requested though, if proves necessary/economic overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is over 10 but that's not the whole cost of running a server. If we look at the whole system, very roughly, to store a dataset of 240GB, it will probably require 2 hosts with 128GB memory each (limited by the cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and the memory expansion (from 32G to 128G) are comparable in cost, we save 50% of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

annief commented 11 years ago

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio.  Do you mean more specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and error to meet a given cache hit ratio? Thanks.

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 3:10 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most datasets: a. small objects (smaller than a single MTU) dominate in number in many cases, partly because the use of cache for counters, flags, soft locks prevail, partly because sending very large objects over the network hurts the latencies and stress the network backbone, and hence are usually mitigated by local caching. b. wide range of sizes. Even we established a, it is also true that we have seen data sizes from a single integer to 1MiB (default slab size). Even though a particular workload doesn't cover the whole spectrum, it nonetheless usually covers several slab classes if you go with the default exponential growth with f=1.4; and max/min is often more than an order of magnitude. This is especially true if multitenacy is adopted. c. in terms of total data cached, it will be very application dependent, from a few gigs to tens of terabytes. A more important parameter would be "request to size" ratio. Because it tells you whether CPU/bandwidth or memory/storage is likely to be the bottleneck. So far, observation is that we generally fall on the side of allocating for data size (in contrast to for throughput), indicating that switching to a cheaper storage medium will save money.

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of adoption); 10G NIC can be requested though, if proves necessary/economic overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is over 10 but that's not the whole cost of running a server. If we look at the whole system, very roughly, to store a dataset of 240GB, it will probably require 2 hosts with 128GB memory each (limited by the cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and the memory expansion (from 32G to 128G) are comparable in cost, we save 50% of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27444066.

thinkingfish commented 11 years ago

Yes, 1500B commonly. This line is somewhat arbitrary when it comes to storage, but it simplifies the reasoning of performance as cache latencies are usually networking heavy.
For now, yes. Usually user comes up with an estimate of how quickly data is generated and the total dataset size (as in permanent storage), and decides a baseline. Then we iterate based on hit rate observed. We are moving toward a more informed procedure by observing live traffic and projecting capacity for hit rate based on those numbers, so we can be closer to the ideal allocation in fewer iterations (ideally, the second time should get it right).

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and error to meet a given cache hit ratio? Thanks.

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 3:10 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most datasets: a. small objects (smaller than a single MTU) dominate in number in many cases, partly because the use of cache for counters, flags, soft locks prevail, partly because sending very large objects over the network hurts the latencies and stress the network backbone, and hence are usually mitigated by local caching. b. wide range of sizes. Even we established a, it is also true that we have seen data sizes from a single integer to 1MiB (default slab size). Even though a particular workload doesn't cover the whole spectrum, it nonetheless usually covers several slab classes if you go with the default exponential growth with f=1.4; and max/min is often more than an order of magnitude. This is especially true if multitenacy is adopted. c. in terms of total data cached, it will be very application dependent, from a few gigs to tens of terabytes. A more important parameter would be "request to size" ratio. Because it tells you whether CPU/bandwidth or memory/storage is likely to be the bottleneck. So far, observation is that we generally fall on the side of allocating for data size (in contrast to for throughput), indicating that switching to a cheaper storage medium will save money.

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of adoption); 10G NIC can be requested though, if proves necessary/economic overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is over 10 but that's not the whole cost of running a server. If we look at the whole system, very roughly, to store a dataset of 240GB, it will probably require 2 hosts with 128GB memory each (limited by the cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and the memory expansion (from 32G to 128G) are comparable in cost, we save 50% of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27446415 .

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and error to meet a given cache hit ratio? Thanks.

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 3:10 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most datasets: a. small objects (smaller than a single MTU) dominate in number in many cases, partly because the use of cache for counters, flags, soft locks prevail, partly because sending very large objects over the network hurts the latencies and stress the network backbone, and hence are usually mitigated by local caching. b. wide range of sizes. Even we established a, it is also true that we have seen data sizes from a single integer to 1MiB (default slab size). Even though a particular workload doesn't cover the whole spectrum, it nonetheless usually covers several slab classes if you go with the default exponential growth with f=1.4; and max/min is often more than an order of magnitude. This is especially true if multitenacy is adopted. c. in terms of total data cached, it will be very application dependent, from a few gigs to tens of terabytes. A more important parameter would be "request to size" ratio. Because it tells you whether CPU/bandwidth or memory/storage is likely to be the bottleneck. So far, observation is that we generally fall on the side of allocating for data size (in contrast to for throughput), indicating that switching to a cheaper storage medium will save money.

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of adoption); 10G NIC can be requested though, if proves necessary/economic overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is over 10 but that's not the whole cost of running a server. If we look at the whole system, very roughly, to store a dataset of 240GB, it will probably require 2 hosts with 128GB memory each (limited by the cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and the memory expansion (from 32G to 128G) are comparable in cost, we save 50% of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27446415 .

annief commented 11 years ago

In your setup, what would a request/s-to-size ratio be to motivate moving to higher capacity/lower cost SSDs?

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 4:02 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

Yes, 1500B commonly. This line is somewhat arbitrary when it comes to storage, but it simplifies the reasoning of performance as cache latencies are usually networking heavy.
For now, yes. Usually user comes up with an estimate of how quickly data is generated and the total dataset size (as in permanent storage), and decides a baseline. Then we iterate based on hit rate observed. We are moving toward a more informed procedure by observing live traffic and projecting capacity for hit rate based on those numbers, so we can be closer to the ideal allocation in fewer iterations (ideally, the second time should get it right).

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and error to meet a given cache hit ratio? Thanks.

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 3:10 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most datasets: a. small objects (smaller than a single MTU) dominate in number in many cases, partly because the use of cache for counters, flags, soft locks prevail, partly because sending very large objects over the network hurts the latencies and stress the network backbone, and hence are usually mitigated by local caching. b. wide range of sizes. Even we established a, it is also true that we have seen data sizes from a single integer to 1MiB (default slab size). Even though a particular workload doesn't cover the whole spectrum, it nonetheless usually covers several slab classes if you go with the default exponential growth with f=1.4; and max/min is often more than an order of magnitude. This is especially true if multitenacy is adopted. c. in terms of total data cached, it will be very application dependent, from a few gigs to tens of terabytes. A more important parameter would be "request to size" ratio. Because it tells you whether CPU/bandwidth or memory/storage is likely to be the bottleneck. So far, observation is that we generally fall on the side of allocating for data size (in contrast to for throughput), indicating that switching to a cheaper storage medium will save money.

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of adoption); 10G NIC can be requested though, if proves necessary/economic overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is over 10 but that's not the whole cost of running a server. If we look at the whole system, very roughly, to store a dataset of 240GB, it will probably require 2 hosts with 128GB memory each (limited by the cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and the memory expansion (from 32G to 128G) are comparable in cost, we save 50% of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27446415 .

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and error to meet a given cache hit ratio? Thanks.

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 3:10 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most datasets: a. small objects (smaller than a single MTU) dominate in number in many cases, partly because the use of cache for counters, flags, soft locks prevail, partly because sending very large objects over the network hurts the latencies and stress the network backbone, and hence are usually mitigated by local caching. b. wide range of sizes. Even we established a, it is also true that we have seen data sizes from a single integer to 1MiB (default slab size). Even though a particular workload doesn't cover the whole spectrum, it nonetheless usually covers several slab classes if you go with the default exponential growth with f=1.4; and max/min is often more than an order of magnitude. This is especially true if multitenacy is adopted. c. in terms of total data cached, it will be very application dependent, from a few gigs to tens of terabytes. A more important parameter would be "request to size" ratio. Because it tells you whether CPU/bandwidth or memory/storage is likely to be the bottleneck. So far, observation is that we generally fall on the side of allocating for data size (in contrast to for throughput), indicating that switching to a cheaper storage medium will save money.

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of adoption); 10G NIC can be requested though, if proves necessary/economic overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is over 10 but that's not the whole cost of running a server. If we look at the whole system, very roughly, to store a dataset of 240GB, it will probably require 2 hosts with 128GB memory each (limited by the cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and the memory expansion (from 32G to 128G) are comparable in cost, we save 50% of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27446415 .

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27447469.

thinkingfish commented 11 years ago

Hmmm, there are many variables that it depends one. On the service side, query rate and key/value size. On the HW side, # cores, network bandwidth and available memory.

Let's fixate key/value size first. Assuming we have 100 Byte key+value, 100k qps single core (HW thread) performance. Now assuming a base-line platform (the lowest model one can get in a DC) of 8 cores, 1 Gigalink, the total throughput would be 800k qps & 80MB/s and the service is CPU(locking) bound. A 32GB heap (the amount of memory coming with the baseline platform, actual usable capacity will be smaller) sets the request-to-size ratio to .25% per second (throughput to dataset size ratio is .25%). Since dataset size is decided independent of the request rate- let's assume hit rate doesn't affect users' cache behavior in a significant way, we are going to set a target hit rate and decide the working dataset size S based on observed traffic pattern, we can say if S makes request-to-size ratio lower than .25% (e.g. need 64GB dataset to achieve 90% target hit rate), we should go with SSD.

Now we can vary key/value size. If we go down in key-value size, the bottleneck remain the same and we will be using less storage, so it will be served by the baseline model as-is. If we go up in size (to beyond 150Bytes), the bottleneck shifts to the NIC, we will be forced to partition the dataset further to accommodate traffic, which means we have a lower request-to-size cutoff about when to switch to SSD (switching when below the cutoff).

At a high level the incentive comes from datasets with a long tail (large datasets to meet a certain hit rate goal) and relatively small traffic hitting the dataset (and making in-memory solutions expensive judged by CPU/network resource utilization). Since long tail is prevalent in many workloads, we expect to start from the ones that obviously will benefit from the transition.

All this calculation can be better presented with a chart. I haven't got the time to think thoroughly about this yet, but if it's helpful I can plug in more numbers later.

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 4:59 PM, annief notifications@github.com wrote:

In your setup, what would a request/s-to-size ratio be to motivate moving to higher capacity/lower cost SSDs?

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 4:02 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

Yes, 1500B commonly. This line is somewhat arbitrary when it comes to storage, but it simplifies the reasoning of performance as cache latencies are usually networking heavy.

For now, yes. Usually user comes up with an estimate of how quickly data is generated and the total dataset size (as in permanent storage), and decides a baseline. Then we iterate based on hit rate observed. We are moving toward a more informed procedure by observing live traffic and projecting capacity for hit rate based on those numbers, so we can be closer to the ideal allocation in fewer iterations (ideally, the second time should get it right).

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and error to meet a given cache hit ratio? Thanks.

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 3:10 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most datasets: a. small objects (smaller than a single MTU) dominate in number in many cases, partly because the use of cache for counters, flags, soft locks prevail, partly because sending very large objects over the network hurts the latencies and stress the network backbone, and hence are usually mitigated by local caching. b. wide range of sizes. Even we established a, it is also true that we have seen data sizes from a single integer to 1MiB (default slab size). Even though a particular workload doesn't cover the whole spectrum, it nonetheless usually covers several slab classes if you go with the default exponential growth with f=1.4; and max/min is often more than an order of magnitude. This is especially true if multitenacy is adopted. c. in terms of total data cached, it will be very application dependent, from a few gigs to tens of terabytes. A more important parameter would be "request to size" ratio. Because it tells you whether CPU/bandwidth or memory/storage is likely to be the bottleneck. So far, observation is that we generally fall on the side of allocating for data size (in contrast to for throughput), indicating that switching to a cheaper storage medium will save money.

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of adoption); 10G NIC can be requested though, if proves necessary/economic overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is over 10 but that's not the whole cost of running a server. If we look at the whole system, very roughly, to store a dataset of 240GB, it will probably require 2 hosts with 128GB memory each (limited by the cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and the memory expansion (from 32G to 128G) are comparable in cost, we save 50% of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27446415> .

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and error to meet a given cache hit ratio? Thanks.

From: Yao Yue [mailto:notifications@github.com] Sent: Wednesday, October 30, 2013 3:10 PM To: twitter/fatcache Cc: Foong, Annie Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most datasets: a. small objects (smaller than a single MTU) dominate in number in many cases, partly because the use of cache for counters, flags, soft locks prevail, partly because sending very large objects over the network hurts the latencies and stress the network backbone, and hence are usually mitigated by local caching. b. wide range of sizes. Even we established a, it is also true that we have seen data sizes from a single integer to 1MiB (default slab size). Even though a particular workload doesn't cover the whole spectrum, it nonetheless usually covers several slab classes if you go with the default exponential growth with f=1.4; and max/min is often more than an order of magnitude. This is especially true if multitenacy is adopted. c. in terms of total data cached, it will be very application dependent, from a few gigs to tens of terabytes. A more important parameter would be "request to size" ratio. Because it tells you whether CPU/bandwidth or memory/storage is likely to be the bottleneck. So far, observation is that we generally fall on the side of allocating for data size (in contrast to for throughput), indicating that switching to a cheaper storage medium will save money.

what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of adoption); 10G NIC can be requested though, if proves necessary/economic overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is over 10 but that's not the whole cost of running a server. If we look at the whole system, very roughly, to store a dataset of 240GB, it will probably require 2 hosts with 128GB memory each (limited by the cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and the memory expansion (from 32G to 128G) are comparable in cost, we save 50% of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27446415> .

— Reply to this email directly or view it on GitHub< https://github.com/twitter/fatcache/issues/12#issuecomment-27447469>.

— Reply to this email directly or view it on GitHubhttps://github.com/twitter/fatcache/issues/12#issuecomment-27450723 .

twitter / fatcache

Using fatcache #12