minio / sidekick

High Performance HTTP Sidecar Load Balancer
GNU Affero General Public License v3.0
546 stars 82 forks source link

[faq] question about the sidekick cache feature, is it one distributed client side cache? #35

Closed gwnet closed 4 years ago

gwnet commented 4 years ago

one question to clarify how sidekick minio cache works. for example there is two clients ClientA build sidekick client minio cache to the remote minio server ClientB build another sidekick client minio cache to the same remote minio server. how the caches are replicated between clientA and clientB.

can you clarify?

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Context

Regression

Your Environment

harshavardhana commented 4 years ago

ClientA build sidekick client minio cache to the remote minio server ClientB build another sidekick client minio cache to the same remote minio server. how the caches are replicated between clientA and clientB.

Caches are not replicated, cache is a centralized shared cache between clients

gwnet commented 4 years ago

@harshavardhana can you clarify a little more detail? from the project introduction, it show me client app and sidekick is deployed at client machine. it can save the network overhead between the client app and minio cache inside sidekick. this is my understanding when I see the diagram. but if you mentioned that it is centralized shared cache, if so it is not one client side cache, it should be one server side cache. that all clients need access the sidekick via network.

gwnet commented 4 years ago

@harshavardhana project main page mentioned that each sidekick is deployed on the each client, for example if clientA modify objectA, how client B get notified that they need invalidate clientB 's cache

harshavardhana commented 4 years ago

@harshavardhana project main page mentioned that each sidekick is deployed on the each client, for example if clientA modify objectA, how client B get notified that they need invalidate clientB 's cache

The cache is centralized @gwnet invalidate automatically happens.

Clients are not caching things independently.

harshavardhana commented 4 years ago

@harshavardhana can you clarify a little more detail? from the project introduction, it show me client app and sidekick is deployed at client machine. it can save the network overhead between the client app and minio cache inside sidekick. this is my understanding when I see the diagram. but if you mentioned that it is centralized shared cache, if so it is not one client side cache, it should be one server side cache. that all clients need access the sidekick via network.

It is never mentioned as client side cache, it is a shared cache.

gwnet commented 4 years ago

@harshavardhana thank you so much. I guess I get it. the cache minio server is deployed with client app cluster, for example the spark cluster. the cache minio server is deployed with distributed way at the spark cluster. so when spark worker read cache minio server via sidekick, minio server will fetch the contents from other nodes inside the cache minio server then reply to spark, this cannot save all the overheads of network. and cache minio server is distributed deployed, the remote minio server is distributed too. is this correct? so sidekick need deploy on each node of spark cluster, sidekick will send requests to his localhost minio cache server always, right?

harshavardhana commented 4 years ago

@harshavardhana thank you so much. I guess I get it. the cache minio server is deployed with client app cluster, for example the spark cluster. the cache minio server is deployed with distributed way at the spark cluster. so when spark worker read cache minio server via sidekick, minio server will fetch the contents from other nodes inside the cache minio server then reply to spark, this cannot save all the overheads of network. and cache minio server is distributed deployed, the remote minio server is distributed too. is this correct? so sidekick need deploy on each node of spark cluster, sidekick will send requests to his localhost minio cache server always, right?

cache server is different than the one you are using for your actual data @gwnet - cache server is a more high-performance server serving perhaps an optane SSD like entity which can perform high speed read/writes.

You shouldn't re-purpose your existing distributed cluster of MinIO to cache its own content again using sidekick i.e not an ideal architectural choice and wouldn't give you the performance gain you would get from using caching.

If you do not have fast Optane like SSDs its not worth for you to use caching, MinIO distributed cluster will deliver the necessary performance that you need for the hardware that you have. sidekick will efficiently load balance the incoming requests.

gwnet commented 4 years ago

@harshavardhana so for this comments, I am confused. what is purpose of sidekick on the distributed cache minio server? from your main page, I see the cache is inside sidekick. could you please help me clarify?

You shouldn't re-purpose your existing distributed cluster of MinIO to cache its own content again using sidekick i.e not an ideal architectural choice and wouldn't give you the performance gain you would get from using caching.

I do have optane and many QLC, I plan put Optane and QLC as the cache minio server. what would be the deployment and IO path, and I want to minio remote server on HDD as tier 2 storage. can you give me one detail IO lifecyle between client, HTTP cache layer, minio cache, sidekick and remote minio?

harshavardhana commented 4 years ago

@harshavardhana so for this comments, I am confused. what is purpose of sidekick on the distributed cache minio server? from your main page, I see the cache is inside sidekick. could you please help me clarify?

cache is not inside sidekick, sidekick uses an S3 backend as shared cache. This S3 backend preferably MinIO is running on an optane SSD. sidekick is just a smart load balancer to your actual large scale data cluster, to be used as a sidecar application along with the application. For example spark examples provided in the README explain this.

You shouldn't re-purpose your existing distributed cluster of MinIO to cache its own content again using sidekick i.e not an ideal architectural choice and wouldn't give you the performance gain you would get from using caching.

I do have optane and many QLC, I plan put Optane and QLC as the cache minio server. what would be the deployment and IO path, and I want to minio remote server on HDD as tier 2 storage. can you give me one detail IO lifecyle between client, HTTP cache layer, minio cache, sidekick and remote minio?

For detailed architecture guidance we recommend commercial engagements. Reach out to us for more hands on guidance from our website https://min.io/pricing

gwnet commented 4 years ago

let us take one example. I have 4 nodes as spark cluster called clusterSpark, I have another 4 nodes as minio remote server with HDDs, we call it clusterMinio

  1. install the minio server distributed way on clusterMinio, this is real data is in.
  2. install the minio server distributed way on clusterSpark, use Optane + QLC, this is cache minio server
  3. install sidekick on each node of clusterSpark, configure sidekick's cache point to minio server that is setup by step2 above. to save network, each node sidekick can configure his local IP address of the cache minio server. is this correct? then when IO from spark comes in
  4. IO goes into sidekick first,
  5. sidekick will try get it from cache minio server that is deployed inside spark cluster.
  6. not hit, sidekick will pass IO to its inernal load balance to the remote minio server. so the load balance happen after the cache minio server. the cache minio server is on the front of load balance in the IO stack. is this correct understanding?
gwnet commented 4 years ago

@harshavardhana expert, could you please comment my above understanding?

harshavardhana commented 4 years ago

let us take one example. I have 4 nodes as spark cluster called clusterSpark, I have another 4 nodes as minio remote server with HDDs, we call it clusterMinio

  1. install the minio server distributed way on clusterMinio, this is real data is in.
  2. install the minio server distributed way on clusterSpark, use Optane + QLC, this is cache minio server
  3. install sidekick on each node of clusterSpark, configure sidekick's cache point to minio server that is setup by step2 above. to save network, each node sidekick can configure his local IP address of the cache minio server. is this correct? then when IO from spark comes in
  4. IO goes into sidekick first,
  5. sidekick will try get it from cache minio server that is deployed inside spark cluster.
  6. not hit, sidekick will pass IO to its inernal load balance to the remote minio server. so the load balance happen after the cache minio server. the cache minio server is on the front of load balance in the IO stack. is this correct understanding?

:+1:

gwnet commented 4 years ago

:) @harshavardhana thank you so much man!~

gwnet commented 4 years ago

@harshavardhana hello sidekick cache is one read cache only, right? for all the writes, it will pass through to backend directly? if so if app is spark or machine learning that need low latency write, how to handle this?

harshavardhana commented 4 years ago

@harshavardhana hello sidekick cache is one read cache only, right? for all the writes, it will pass through to backend directly? if so if app is spark or machine learning that need low latency write, how to handle this?

spark is read heavy than write heavy @gwnet