How to evaluate rss cluster size?

uber / RemoteShuffleService

Remote shuffle service for Apache Spark to store shuffle data on remote servers.

Other

321 stars 100 forks source link

Hi, we have about 20,000 daily spark applications. All these apps produce 100TB shuffle writes/reads data respectively. A peak app produces 6TB, but most produce less than 100GB.

We ran some online spark apps on test rss cluster and found that rss node consume obviously more memory than cpu/disk/etc which even cause some rss/node break down .

Now we would like to run all apps on rss, any suggestions on rss cluster size and machine selection with replicas=2 ？ eg.memory, cpu, node num,disk,memory-cpu ratio, etc. Any suggesion will help.

One more: mappers send data to StreamServer，StreamServer stores in memory，then flush to disk, not many calculations, so rss consumes memory a lot more than cpu, is that right?

uber / RemoteShuffleService

How to evaluate rss cluster size? #92