twitter / summingbird

Streaming MapReduce with Scalding and Storm
https://twitter.com/summingbird
Apache License 2.0
2.14k stars 267 forks source link

Can summingbird integrated with Impala for real-time Query? #694

Closed leesf closed 7 years ago

leesf commented 7 years ago

In summingbird client, can we use Impala to query from online k-v store and offline k-v store for real-time Query?

johnynek commented 7 years ago

I don't know how to write plugins for Impala but it may be possible.

Summingbird pre-aggregates all the data, so you generally don't need a system like Impala to read results. On Tue, Oct 11, 2016 at 03:38 leesf notifications@github.com wrote:

In summingbird client, can we use Impala to query from online k-v store and offline k-v store for real-time Query?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/twitter/summingbird/issues/694, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJduUxhhNxtIOA4BkxJ4Gm2N-vjci4ks5qy5E_gaJpZM4KTo22 .

leesf commented 7 years ago

It seems summingbird can get the count of some key(in key-value pair) in client, but for complex requirements that not for counting, what can we do? thanks for your answer @johnynek

johnynek commented 7 years ago

you can do joins inside summingbird, you can do aggregations other than counting inside summingbird, but they need to a common associative structure. All the algorithms in algebird can be used (for instance HyperLogLog, CountMinSketch, QTree and more):

https://github.com/twitter/algebird

but I don't know what your use case is exactly.

leesf commented 7 years ago

@johnynek thanks for your timely reply, the query case is below:

SELECT R.MKTSEGMENT, COUNT(S.ORDERKEY) FROM CUSTOMER R join ORDERSS on R.CUSTKEY=S.CUSTKEY GROUP BY R.MKTSEGMENT HAVING COUNT(*) > 3

the tables are in HBase(used both for online and offline store), you can ignore the table and field, which appropriate algebird can i use ? I don't know whether i made it clear. Thanks for your answering again. @johnynek

johnynek commented 7 years ago

This could be written with summingbird, but if your data is small enough a periodic poll of some presto or redshift data would probably be easier (basic SQL like this has pretty much a 1-to-1 correspondence with summingbird code).

The problem with summingbird is that it is more for "data-engineering". Just one query will be a pain because you need to set up a storm or heron cluster as well as have a hadoop cluster, and finally, you need to have some realtime storage.

Frankly, there is a lot to set-up. I would only recommend summingbird to someone that:

  1. had a number of data-engineering problems to solve involving real-time queries
  2. already is using hadoop
  3. is willing to do the steps to set up the pieces.

It is used in production at Twitter fairly heavily, and also at Stripe, but probably not many other places because we really have not written nice guides on how to set it up.

I would look at: https://github.com/twitter/summingbird/tree/develop/summingbird-example/src/main/scala/com/twitter/summingbird/example and try to compile that and run on your local machine (the README has some hints). If it seems like it makes sense for your use case, join the mailing list and ask questions as they come up.

leesf commented 7 years ago

Under your advice, i would compile and run the example successfully in my local machine, and i have run the counting jobs in hybrid mode in my machine. and i need to do real-time queries in hybrid mode because i want to combine the historical data with up-to-date data, but the question right now is how to implement the SQL query job mentioned above with Summingbird core. And one more question, in some documents about Summingbird, i learned that Summingbird designs the cache to avoid frequently access to online k-v store, once the cache is full, the result will be imported to k-v store. But in my view, it will cause some questions because the queries from client access to both online and offline k-v store to get results, the online k-v store would not contain result because it may in cache instead of online k-v store(the cache is not full). how does Summingbird deal with it?

johnynek commented 7 years ago

About the cache, you can tune how often you flush (both in time, and in size of the cache). If you don't have the high write-rate that Twitter does, you could probably disable caching entirely for the freshest results.

To deal with somewhat "hot" keys, you might have a limited caching (say 1 second flushes). At twitter in the cases where latency is not the utmost issue, you might have 10 second cache to dramatically reduce the write rate.

leesf commented 7 years ago

Thanks for your timely reply, and you make it clear, and helps everything goes will for you. @johnynek However, as i know more and more about Summingbird, i think it is not efficient enough with online k-v store. When we make some queries, firstly, it fetches results from the batch store to get BatchID and V, then computes the BatchIDs with current time, and finally, for each BatchID, use BatchID and K to query from online store. That is to say, for every query, Summingbird need to query online store many times, it increases the burden on the database. So i want to design my own online k-v store with Storehaus that only need to visit once and then can get the total results in k-v store. is the optimization feasible?

johnynek commented 7 years ago

Generally you will only need to hit the offline store once and the online store once. If your offline jobs get very delayed you may need to hit more online keys, but this should be rare. If you use larger batch sizes, this can be very rare. At Stripe we use 1-day batch sizes, so we almost always hit at most 2 online keys, but generally 1 except during a short window while the offline jobs run each day.

Lastly, you can cache the online values that are not corresponding to the "now" Batch. We have never actually needed to do this at Twitter or Stripe, but if you have higher query rates, you may need to (but I seriously doubt it would be an issue).

leesf commented 7 years ago

OK, Thank you so much @johnynek