lucidworks / hive-solr

Code to index Hive tables to Solr and Solr indexes to Hive
Apache License 2.0
47 stars 34 forks source link

Indexing to a TRA (Time Routed Alias) is not supported? #63

Open nashornjohn opened 5 years ago

nashornjohn commented 5 years ago

I've been using hive-solr to index my data to Solr collections very well. Last week I wanted to try Time Routed Alias feature of Solr, which behaved as expected when I manually indexed a few test documents. I prepared some data to index into a TRA but hive-solr didn't seem to see the alias. It behaved like I was indexing into a non-existent collection. Does hive-solr support the TRA feature at all?

ctargett commented 5 years ago

Can you share what you see for errors or output on either the Hive or Solr side? I don't know of any reason why TRA wouldn't be supported, but I do know it's never been tested and there could be some hidden limitation there.

nashornjohn commented 5 years ago

file.log I have uploaded a log file. This is what happens if I use TRA name as my external table target.

BTW, currently I'm using the first collection created by TRA as the target of my external table. This way I'm able to index my data, and part of my documents are distributed to their right collections, but not all of them. That's another problem, actually. I reported this second problem on Solr users' mailing list.

gerlowskija commented 5 years ago

Hey John, thanks for the logs and thanks for bringing this to our attention.

I'm gonna take a look at reproducing and will post here if/when I find something. In the meantime, do you have the commands you used to setup your TRA collection group? Any other details you could provide would also help me check my work, so to speak.

nashornjohn commented 5 years ago

Hi. The commands used are:

CREATE EXTERNAL TABLE solr.data_tra_test(
  myfield1 string,
  myfield2 int,
  myfield3 int,
  myfield4 string,
  myfield5 string
)
ROW FORMAT SERDE
  'com.lucidworks.hadoop.hive.LWSerDe'
STORED BY
  'com.lucidworks.hadoop.hive.LWStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1')
LOCATION
  'hdfs://hivesolr1.mydomain.com:8020/user/admin/solr/data_tra'
TBLPROPERTIES (
  'solr.collection'='data_tra_test',
  'solr.query'='*:*',
  'solr.zkhost'='zk1.mydomain.com:2181,zk2.mydomain.com:2181,zk3.mydomain.com:2181');

Thank you for your time.

gerlowskija commented 5 years ago

Hey John, quick update on my progress:

I'm not able to reproduce the particular error you're seeing. Here's the steps I'm trying:

  1. Create a "normal" table in Hive: CREATE TABLE time_data1 (id INT, date_dt STRING) LOCATION '/user/hive/time_data1';
  2. Fill the table with some time-based data. In this example I create a document for each hour so far today, and make "+1HOUR" the interval for my TRA below. (so each collection should get 1 doc)
    INSERT INTO time_data1 VALUES (0, '2018-11-29 00:15:00');
    INSERT INTO time_data1 VALUES (1, '2018-11-29 01:15:00');
    INSERT INTO time_data1 VALUES (2, '2018-11-29 02:15:00');
    ...
  3. Create a TRA in Solr: curl -ilk -X GET "http://$HOST:8983/solr/admin/collections?action=CREATEALIAS&name=timedata&router.start=NOW/DAY&router.field=date_dt&router.name=time&router.interval=%2B1HOUR&router.maxFutureMs=3600000&create-collection.collection.configName=_default&create-collection.numShards=1"
  4. Create an external hive table linked to this collection: CREATE EXTERNAL TABLE time_data2 ( id_i INT, date_dt STRING) ROW FORMAT SERDE 'com.lucidworks.hadoop.hive.LWSerDe' STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/user/hive/time_data2' TBLPROPERTIES( 'solr.collection' = 'timedata', 'solr.query' = '*:*', 'solr.zkhost' = 'host1:2181,host2:2181,host3:2181/solr');
  5. Index your dummy data to Solr: INSERT INTO time_data2 SELECT id, date_dt FROM time_data1;
  6. Check your results (curl the alias itself, and the underlying collections individually). In my example, queries to the alias return all documents and queries to individual collections each return a single document.

Am I doing something different in my reproduction here that you're not doing on your end? If so, let me know and maybe we can understand what's going on.


While I didn't find the particular error you mentioned, I did see a few other things that might be bugs. Specifically:

  1. If I store the date in Hive as a TIMESTAMP, indexing to Solr fails because SolrJ doesn't understand how to parse Timestamp objects. This has nothing to do with TRA's as far as I can tell and is more of a datatype mapping problem.
  2. The steps I outlined above allow indexing into a Solr TRA. But querying TRA's doesn't seem to work as expected. (That is, a SELECT * FROM time_data2 only returns a single document, suggesting that it's only looking at a single collection)
gerlowskija commented 5 years ago

querying TRA's doesn't seem to work...it's only looking at a single collection

Curious, I tried hive-solr out with a standard alias and I see the same behavior. If I set up an alias 'A' to point at two collections (C1 and C2) which both have data, querying an external table linked with 'A' only returns the data from C2. I suspect this a deficiency in how we turn a collection name into map-red splits. At first glance the code that does this here doesn't appear to be alias-aware. Still testing though.

EDIT: found a bug in our querying logic and proposed a fix here. I'm still not able to reproduce the issue you reported though John, so if you've got any more specific instructions for reproducing, please let me know.

nashornjohn commented 5 years ago

Hi Jason, thank you for your attention. I'm still getting the NPE with similar steps above. Could this be specific to the versions of my stack?

HDP version: 2.6.1.0 Hive-solr version: 3.0.0 Solr version: 7.5.0

Here are my steps and output:

  1. curl -s "http://solr1.mydomain.com:8981/solr/admin/collections?action=CREATEALIAS&name=timedata&router.start=2018-11-29T00:00:00Z&router.field=date_dt&router.name=time&router.interval=%2B1HOUR&router.maxFutureMs=3600000&create-collection.collection.configName=_default&create-collection.numShards=1"
  2. CREATE TABLE solr.time_data1 (id INT, date_dt STRING) LOCATION '/user/hive/time_data1'
  3. INSERT INTO solr.time_data1 VALUES (0, '2018-11-29 00:15:00');
    INSERT INTO solr.time_data1 VALUES (1, '2018-11-29 01:15:00');
    INSERT INTO solr.time_data1 VALUES (2, '2018-11-29 02:15:00');
  4. CREATE EXTERNAL TABLE time_data2 (id_i INT, date_dt STRING)
    ROW FORMAT SERDE 'com.lucidworks.hadoop.hive.LWSerDe'
    STORED BY 'com.lucidworkshadoop.hive.LWStorageHandler'
    LOCATION '/user/hive/time_data2'
    TBLPROPERTIES( 'solr.collection' = 'timedata', 'solr.query' = '*:*', 'solr.zkhost' = 'zk1.mydomain.com:2181,zk2.mydomain.com:2181,zk3.mydomain.com:2181');
  5. and finally INSERT INTO time_data2 SELECT id, date_dt FROM solr.time_data1;

Excerpt of output related to the error:

...
2018-11-30 07:39:07,367 [INFO] [TezChild] |io.LucidWorksWriter|: Sending 3 documents
2018-11-30 07:39:07,378 [ERROR] [TezChild] |impl.CloudSolrClient|: Request to collection timedata failed due to (0) java.lang.NullPointerException, retry? 0
2018-11-30 07:39:07,379 [INFO] [TezChild] |io.LucidWorksWriter|: Enter retry logic with Exception
shaded.org.apache.solr.client.solrj.SolrServerException: java.lang.NullPointerException
    at shaded.org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:1246)
    at shaded.org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:1073)
    at shaded.org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:160)
    at shaded.org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106)
    at shaded.org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71)
    at shaded.org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85)
    at com.lucidworks.hadoop.io.LucidWorksWriter.sendBuffer(LucidWorksWriter.java:245)
    at com.lucidworks.hadoop.io.LucidWorksWriter.close(LucidWorksWriter.java:265)
    at com.lucidworks.hadoop.hive.LWHiveOutputFormat$1.close(LWHiveOutputFormat.java:50)
...
Caused by: java.lang.NullPointerException
    at shaded.org.apache.solr.client.solrj.impl.CloudSolrClient.directUpdate(CloudSolrClient.java:760)
    at shaded.org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1263)
    at shaded.org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:1134)
...
gerlowskija commented 5 years ago

Hey John, no problem.

I think you're right about the versions being relevant here. Thanks for mentioning that, I should have asked earlier.

hive-solr 3.0.0 is aimed at Solr 6.x. It was tested primarily against Solr 6.6.2, and bundles Solr 6.6.2 internally for making requests. Since Solr itself doesn't guarantee client/server compatibility across major releases, we can't guarantee that hive-solr 3.0.0 will work with Solr 7.5 in all situations. Especially for features like TRAs which were introduced in 7x

hive-solr 4.0.0 is our release targeting Solr 7.x, and I expect you'll see the correct behavior if you use that instead. If you can, give that a shot and let me know what happens. (I'll give 3.0.0 a shot as well, to double-check that this is version-specific behavior we're seeing.)