Indexing more than 250MLN from Hive to SolR

disoardi commented 7 years ago

Hi all,

We are trying to index more than 250MLN rows from Hive table (ORC format) but we have noticed that the indexing is too slow.

We have 9 SolR nodes (9 shards and 2 replicas per shard) and we have set the maxIndexingThreads parameter to 128 and the ramBufferSizeMB one to 60MB.

While launching the INSERT INTO on the external table, where the hive-serde is used, the servers CPU is idle and the indexig througput is lower than 1MLN per hour.

Since the servers are idle how can we do it faster? We have a lot of CPUs and RAM but we are not able to use them for the indexing process. Any suggested? Can be useful to configure any parameters on the client side to use all the threads? Thanks in advance.

PS: We have set the commit (auto and soft) to 10 minutes or 1MLN of documents.

acesar commented 7 years ago

@disoardi Please share your the Solr table configuration. Please also share the versions of Solr, Hive/Tez and Yarn (how many Yarn node do you have?)

disoardi commented 7 years ago

Hive external Table with serde:

ADD JAR /home/my_comp/solr-hive-serde-2.2.1.jar;

CREATE EXTERNAL TABLE if not exists  my_comp_solr.my_comp_user_number(
id STRING,
cod_1 STRING,
cod_2 STRING,
cod_tipo STRING,
flg_delete int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
     LOCATION '/user/solr/my_comp_user_number'
     TBLPROPERTIES('solr.zkhost' = 'xxx.xxx.xxx.xxx:2181/solr',
                   'solr.collection' = 'my_comp_user_number',
                   'solr.query' = '*:*');

hortonworks Data Platform 2.3.2 Solr version --> 5.5.0 with 20GB of Xmx per nodes Hive version --> 1.2.1 TEZ version --> 0.7.0 YARN version --> 2.7.1

We have 9 Yarn nodes with 96GB per nodes ( TOT 864GB of YARN queue)

Thanks in advance

acesar commented 7 years ago

Do you have only one Zookeeper node? Usually the minimum recommended amount of Zookeeper nodes is 3. The zk string should be something like:

   'solr.zkhost' = 'host1:2181,host2:2181,host3:2181/solr'

Can you share the output of the indexing? Are there any errors in the yarn/hive logs?

You can try increasing the the Solr buffer size lww.buffer.docs.size, by default is 500 documents.

The lww.buffer.docs.size can be set as a global hive property, or TBLPROPERTIES

hive> set lww.buffer.docs.size=5000

Some test with 3 Solr/Yarn nodes: (Solr and Yarn were installed in the same node)

 [1000000 hive records]: 
8 shards -> 101.096 seconds
2 shards -> 229.545 seconds

disoardi commented 7 years ago

Sorry for the dealy, but I found the solution: I set solr.client.threads. The default is 1.

Do you have any documentation about this options?

Thanks in advance

NethajiRajamanickam commented 5 years ago

Hive external Table with serde:
ADD JAR /home/my_comp/solr-hive-serde-2.2.1.jar;

CREATE EXTERNAL TABLE if not exists  my_comp_solr.my_comp_user_number(
id STRING,
cod_1 STRING,
cod_2 STRING,
cod_tipo STRING,
flg_delete int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
     LOCATION '/user/solr/my_comp_user_number'
     TBLPROPERTIES('solr.zkhost' = 'xxx.xxx.xxx.xxx:2181/solr',
                   'solr.collection' = 'my_comp_user_number',
                   'solr.query' = '*:*');
hortonworks Data Platform 2.3.2 Solr version --> 5.5.0 with 20GB of Xmx per nodes Hive version --> 1.2.1 TEZ version --> 0.7.0 YARN version --> 2.7.1

We have 9 Yarn nodes with 96GB per nodes ( TOT 864GB of YARN queue)

Thanks in advance

Hive external Table with serde:
ADD JAR /home/my_comp/solr-hive-serde-2.2.1.jar;

CREATE EXTERNAL TABLE if not exists  my_comp_solr.my_comp_user_number(
id STRING,
cod_1 STRING,
cod_2 STRING,
cod_tipo STRING,
flg_delete int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
     LOCATION '/user/solr/my_comp_user_number'
     TBLPROPERTIES('solr.zkhost' = 'xxx.xxx.xxx.xxx:2181/solr',
                   'solr.collection' = 'my_comp_user_number',
                   'solr.query' = '*:*');
hortonworks Data Platform 2.3.2 Solr version --> 5.5.0 with 20GB of Xmx per nodes Hive version --> 1.2.1 TEZ version --> 0.7.0 YARN version --> 2.7.1

We have 9 Yarn nodes with 96GB per nodes ( TOT 864GB of YARN queue)

Thanks in advance

hi Please share the jar file

lucidworks / hive-solr

Indexing more than 250MLN from Hive to SolR #23