Closed commercial-hippie closed 5 years ago
I just ran some more tests one this, and while I think it is a good idea it looks like it might cause more errors for tables that have multiple partitions already.
https://groups.google.com/forum/#!topic/presto-users/5gFbvUoOF5I
Yes, there are issues with simply increasing number of partitions especially with Hive. In fact, it is why VerdictDB internally has 100 as the default value for the maximum number of partition in scrambles.
If we want to add data to an existing scramble, it is more than likely that we should work with existing partitions.
@dongyoungy Can you check if the partition count limit is (1) per table, or (2) per insertion or create statement? If the latter is the case, we can still easily support daily data ingestion. (under the assumption that the insertion date is the only column on which a table is partitioned).
I found that It is the latter (e.g., a single INSERT .. SELECT ... statement cannot insert into more than hive.exec.max.dynamic.partitions
partitions). It seems that there is no limit on a number of 'per-table' partitions even though having too many partitions may cause performance degradation and other possible problems (people talk about this, but I cannot say for sure as I have not experienced such cases myself yet).
Thus, as long as we specify which partition that we create and insert data, I think we can actually do daily data integration following your assumption.
I believe this issue is related to https://github.com/mozafari/verdictdb/issues/302. So, let's see what @commercial-hippie says about his requirements. If he indeed chooses to pursue daily data ingestion, we can handle two issues (https://github.com/mozafari/verdictdb/issues/302 and this) together.
@pyongjoo I think having the partitions would be used in both cases.. When we add new data as well as when we backfill older data..
I do have a modified version of Verdict that I used to test this but I ran into the hive error mentioned in comment 2.
We do have quite a few partitions currently which have helped speed up non-verdict queries a lot. I might look into only adding one of the most used partitions to the verdict scrambles and increasing "hive.max-partitions-per-writers", instead of having all the partitions.
I'll do some testing using my modified version of Verdict and report back here on the results.
As the title suggests, I think the scrambles should keep the original partitions and just prepend the
verdictdbblock
partition.Currently it doesn't work that way.. But looking at the code I think that's how it was intended to work?