opensrp / opensrp-reveal-datawarehouse

Scripts that build the Reveal data warehouse
0 stars 1 forks source link

Investigate solutions to stream into materialized views #56

Open pld opened 3 years ago

pld commented 3 years ago

Questions from @moshthepitt

I am generally on board. But these things are unclear to me:

  • is this a SAAS offering, or something that we can self-host if we wanted to?
  • how do we "deploy" to is? our migrations will work?

Tasks:

moshthepitt commented 3 years ago

Posting a comment from slack: if we are going in Pinot's direction we might as well think about something like cassandra which is in the same class of tools imho and arguably more well-known.

moshthepitt commented 3 years ago

I also think that in comparing these different tools we need to make them produce the same report from the same database and then we can objectively weigh them against each other and against postgres. Luckily (ha!) we can use one of the already existing and challenging Zambia IRS reports.

gstuder-ona commented 3 years ago

@ukanga pinot's arch at first glance seems fairly similar to Druid, but isn't Superset-compatible. Is there a killer feature it has that you're most interested in? I ask because we've got a little experience in Druid clusters, so wondering how that went for our analytics - I think presto is also compatible with Druid?

gstuder-ona commented 3 years ago

@moshthepitt re questions

There's a lot of details around editing deployed views, etc. to figure out but it's a very targeted fix

pld commented 3 years ago

The 3 options here feel like 3 separate approaches, and (I think?) only materlialize.io addresses the specific problem of view recomputation, although they all target the same underlying problem (reduce time to compute analysis results on big data)

This leads to the broader question of how we want to solve the problem, and I see at least 4 approaches:

  1. materialize/etc (what’s comparable)? solve by updating mat views incrementally
  2. bigquery/redshift/cassandra/vertica/etc solve by improving the query engine to reduce time to run (or recompute) queries and stay in (or near) SQL
  3. Druid/Pinot/etc do the same as (2) but not in SQL
  4. Airflow/Oozie/Azkaban optimize dependant queries to minimize recomputation of incremental results

Dumping everything into redshift and seeing how long our expensive mat views run as queries could be worth looking at, but it looks like there’s a bunch of optimization (designing and tuning tables) that happens there too, although it does support geo, https://docs.aws.amazon.com/redshift/latest/dg/geospatial-overview.html, so we'd have to think if we're merely shifting the problem, probably can't answer that until we spend a day testing.

CC @mberg