verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

STRUCT support in Verdict for scambles #354

Open voycey opened 5 years ago

voycey commented 5 years ago

Hi guys,

One of our tables has recently started receiving data in the form of a struct (array / row).

For example:

{city=Jackson, state=WY, zip=83001, county=Teton, msa=null, country=US} 

{city=Cheyenne, state=WY, zip=82001, county=Laramie, msa=null, country=US}

{city=Gillette, state=WY, zip=82718, county=Campbell, msa=null, country=US}

I was wondering how Verdict builds its scrambles based on this kind of data? Is this a data structure you actively support? Would each of the internal items be capable of producing fast aggregations?

For example:

SELECT count(distinct(Location.city)) from table

Our scramble performance has dropped significantly but we aren't sure if this correlates?

pyongjoo commented 5 years ago

VerdictDB should just work. One possible reason is that columnar format may not be very efficient for such data types.

If you can load sample data into the cluster, we may be able to test them.

pyongjoo commented 5 years ago

@dongyoungy Can you ask someone to investigate this by comparing different compression formats for our scramble tables? Maybe we can try different formats (e.g., ORC or parquet) with different compression schemes.

voycey commented 5 years ago

I'm unsure as to the internals for it but yes I agree that structs on a columnar are probably not ideal - they seem to be the preferred way in BigQuery (where this data has originated from). We are considering flattening them out as a last resort but we would prefer to get some information on exactly how verdict handles this before we do anything drastic :)

pyongjoo commented 5 years ago

@dongyoungy Can you ask @Beastjoe to investigate this issue? I see two related problems:

  1. Performance when the table contains array or struct
  2. Possible performance degradation when samples keep appended
voycey commented 5 years ago

Just an FYI, we are refactoring our tables away from this due to performance issues with these data structures. In BigQuery however these are preferred structures (and fairly efficient) - so might be something you want to look at for that side of things :)

On Tue, 2 Apr. 2019, 02:55 Yongjoo Park, notifications@github.com wrote:

@dongyoungy https://github.com/dongyoungy Can you ask @Beastjoe https://github.com/Beastjoe to investigate this issue? I see two related problems:

  1. Performance when the table contains array or struct
  2. Possible performance degradation when samples keep appended

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mozafari/verdictdb/issues/354#issuecomment-478638034, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBAiqzneUwARuepDX0u-knJF2ktIYb4ks5vciwGgaJpZM4bzEWc .