Problems about Cluster Data Environment

Coder-Qian commented 3 years ago

Question 1: We know that 2^31 (2.1 billion) is the lucene max documents per index. By default, one Cassandra node can only store 2.1 billion data. For a single node, if I want to break through 2.1 billion, are there other ways to do it without virtual indexes? Because I don't know if there are performance issues with virtual indexes.

Question 2: I have a 50 billion time series data. If I use 3 replicas, how many nodes should I set up to achieve the second level data query. How many CPU cores and memory are recommended for each node？ And, How much out of heap memory should I leave for each node? In the production environment, there are 7 servers with 256g and 32 cpu cores. Can they meet the requirements with docker?

Please forgive my poor English. thanks you for your advice.

vroyer commented 3 years ago

Question1: Partitioned indices allows to create many indices based on one single cassandra table, to go beyond the Lucene limit of 2^31. Virtual indices allows to store the Elasticsearch mapping only once for many partitioned indices. Yes, there is a performance penalty if you create a lots of small partitioned indices (Like elasticsearch oversharding).

Question 2: It really depends of your data, 50 billions rows with one column or 200 columns is not the same. A JVM 8 should not have more than 30g heap (see https://www.elastic.co/blog/a-heap-of-trouble https://www.elastic.co/blog/a-heap-of-trouble). You should test your application on a 3 nodes datacenter, add more data until the latency/throughput reach the acceptable limit for you, then scale horizontally to store all your dataset.

On 21 Sep 2020, at 11:46, Coder-Qian notifications@github.com wrote:

Question 1: We know that 2^31 (2.1 billion) is the lucene max documents per index. By default, one Cassandra node can only store 2.1 billion data. For a single node, if I want to break through 2.1 billion, are there other ways to do it without virtual indexes? Because I don't know if there are performance issues with virtual indexes.

Question 2: I have a 50 billion time series data. If I use 3 replicas, how many nodes should I set up to achieve the second level data query. How many CPU cores and memory are recommended for each node？ And, How much out of heap memory should I leave for each node? In the production environment, there are 7 servers with 256g and 32 cpu cores. Can they meet the requirements with docker?

Please forgive my poor English. thanks you for your advice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/strapdata/elassandra/issues/373, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOMPGJITYRWJYKJUTCLQDTSG4OGRANCNFSM4RUKIIWA.

Coder-Qian commented 3 years ago

Our production environment is that we have three tables. Each table has about 70 fields. Each table needs to store 20 billion data. The yyyyMMdd is used as partition keys. It is noticed that each cassandra node can have 30g heap, so, one server who has 256g menory can allocate 5 cassandra nodes by installed 2:1 ratio (heap memory and out heap memory, (256/(30+15) = 5.6). We want to deploy the cluster well when building the production environment, because it is too slow to migrate data after adding nodes to Elassandra. Thank everyone for helping me. Do not let me post sink.

strapdata / elassandra

Problems about Cluster Data Environment #373