Presto scramble should keep original partitions

verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing

http://verdictdb.org

Apache License 2.0

248 stars 66 forks source link

Presto scramble should keep original partitions #303

Closed commercial-hippie closed 5 years ago

commercial-hippie commented 5 years ago

As the title suggests, I think the scrambles should keep the original partitions and just prepend the verdictdbblock partition.

Currently it doesn't work that way.. But looking at the code I think that's how it was intended to work?

commercial-hippie commented 5 years ago

I just ran some more tests one this, and while I think it is a good idea it looks like it might cause more errors for tables that have multiple partitions already.

https://groups.google.com/forum/#!topic/presto-users/5gFbvUoOF5I

dongyoungy commented 5 years ago

Yes, there are issues with simply increasing number of partitions especially with Hive. In fact, it is why VerdictDB internally has 100 as the default value for the maximum number of partition in scrambles.

If we want to add data to an existing scramble, it is more than likely that we should work with existing partitions.

pyongjoo commented 5 years ago

@dongyoungy Can you check if the partition count limit is (1) per table, or (2) per insertion or create statement? If the latter is the case, we can still easily support daily data ingestion. (under the assumption that the insertion date is the only column on which a table is partitioned).

dongyoungy commented 5 years ago

I found that It is the latter (e.g., a single INSERT .. SELECT ... statement cannot insert into more than hive.exec.max.dynamic.partitions partitions). It seems that there is no limit on a number of 'per-table' partitions even though having too many partitions may cause performance degradation and other possible problems (people talk about this, but I cannot say for sure as I have not experienced such cases myself yet).

Thus, as long as we specify which partition that we create and insert data, I think we can actually do daily data integration following your assumption.

pyongjoo commented 5 years ago

I believe this issue is related to https://github.com/mozafari/verdictdb/issues/302. So, let's see what @commercial-hippie says about his requirements. If he indeed chooses to pursue daily data ingestion, we can handle two issues (https://github.com/mozafari/verdictdb/issues/302 and this) together.

commercial-hippie commented 5 years ago

@pyongjoo I think having the partitions would be used in both cases.. When we add new data as well as when we backfill older data..

I do have a modified version of Verdict that I used to test this but I ran into the hive error mentioned in comment 2.

We do have quite a few partitions currently which have helped speed up non-verdict queries a lot. I might look into only adding one of the most used partitions to the verdict scrambles and increasing "hive.max-partitions-per-writers", instead of having all the partitions.

I'll do some testing using my modified version of Verdict and report back here on the results.