Open andrei-ionescu opened 3 years ago
BTW thanks for the great work! and sorry for the delay in reviewing.. 🐢🦥 I'll try to do the review asap.
Could you add logical/spark plan change in the PR description? I'd like to see the plan of join query + filter index for one child. e.g.
val filter = df.filter(..).select("nested.field") // index applied
val filter2 = df.filter(...) // index is not applied for this filter
val join = filter.join(filter2, ...)
@sezruby Thanks for reviewing this PR. I added some more code to support nested fields in joins and all hybrid scans.
I'll create start integrating your feedback.
@imback82, @sezruby Can you have a look why most of the tests are are failing with the error bellow?
java.lang.IllegalArgumentException: Unsupported class file major version 55
[info] at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
[info] at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
[info] at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
[info] at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
[info] at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
[info] at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
[info] at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
[info] at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
[info] at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
[info] at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
It seems that the failures started appearing 11 days ago.
Thanks for reporting this. Build nodes must have changed. @sezruby do you have time to take a look? Otherwise, I will get to this toward the end of this week. We can rely on Scala 2.12 pipeline meanwhile.
@imback82 @sezruby
First, thanks for fixing the build issue for Scala 2.11.
I integrated the feedback and rebased upon master. Please have a second round of PR review. Thanks!
@andrei-ionescu I know this change all comes together, but could you have split this PR into small PRs by any chance? Since it's bit hard to get these in at a time with some proper reviews.
You can keep this PR for reference and I suggest following 3 PRs:
BTW thanks for the plan update :) seems it works well as expected 👍
@sezruby In regards to review effort, I won't help too much because creation and refresh parts are very small and the other two are equally in code change.
I did tackled the feature like this:
I don't understand why you prefer separate PRs instead of separate commits - you can review each commit separately. Do you want to merge them separately? If so, wouldn't Hyperspace be in inconsistent state feature wise?
@andrei-ionescu I have the same experience when I was working on Hybrid Scan last year - #123 (one big PR, but rejected 🙅) and I split the PR into several PRs (#150). And please refer https://github.com/google/eng-practices/blob/master/review/developer/small-cls.md.
I don't understand why you prefer separate PRs instead of separate commits - you can review each commit separately.
Since these commits become 1 commit after merging to master branch.
Do you want to merge them separately? If so, wouldn't Hyperspace be in inconsistent state feature wise?
It's fine as long as they don't break the build & test.
Create & refresh can be small but it requires some utility functions & test setup.
We could also split filter & join into 2 PRs as it includes many lines of tests.
@sezruby The only advantage for separate PRs is if you specifically intend to merge them separately.
For better reading and understanding of the code, we can still keep a single PR but with more commits.
For me building separate PRs means a lot of extra work as I need to keep al those branches up to date through constant rebases and link them one on top of the other. There are multiple changes in the same file and any rebase will result in as many more times of conflict resolving as the branches I create for these PRs.
I would like not to go through this ordeal.
@imback82, @rapoth WDYT?
BTW: What does "CL" stand for? I've seen it in the developer docs but never explained clearly.
+1 for smaller PRs if possible. For this PR, I think it would make sense to split as @sezruby suggested above since they seem easily splittable?
The rationale behind the smaller PRs is described in the doc shared above. Especially, when we have a PR that's +2500 lines long, it's really hard for reviewers to review quickly and correctly. (and reviewing by commits doesn't seem possible when each commit is also big)
For me building separate PRs means a lot of extra work as I need to keep al those branches up to date through constant rebases and link them one on top of the other. There are multiple changes in the same file and any rebase will result in as many more times of conflict resolving as the branches I create for these PRs.
Since you have this "reference" PR, you can just create a PR one by one instead of creating all of them at once? In this way, you don't need to keep many branches up to date.
I would like not to go through this ordeal.
Thanks for working on this great feature. I think smaller PRs will help both sides to iterate quicker. But if you think it's not feasible, you can still keep this PR as one, but please bear with us if it takes a very long time to get it reviewed/merged. Thanks!
(I believe CL stands for change list)?
@imback82 Thanks for your opinion.
If you can guarantee that it will take less time and a better team focus to get everything merged as small PRs than this big PR, then I can see the its advantage. Otherwise it's just extra work on my side.
BTW, It is nowhere specified in any docs that in Hyperspace, one should be using Google's Small CLS engineering protocols.
If you can guarantee that it will take less time and a better team focus to get everything merged as small PRs than this big PR, then I can see the its advantage. Otherwise it's just extra work on my side.
Yes, I think it will definitely accelerate the review process. @sezruby, do you also agree after doing reviews on this PR?
BTW, It is nowhere specified in any docs that in Hyperspace, one should be using Google's Small CLS engineering protocols.
We have been referring to the doc internally if there is a conflict in the discussion. We can make it a formal process if needed; note that we are still improving the process in this repo as needed/required.
Yea as most of functionality is validated in this PR, we can do small PRs only for merging them partially. It's too long scrooooll and got distracted in the middle of reviewing :D ..
@andrei-ionescu with this PR, you won't be able to get the review from @imback82 this year lol
We have been referring to the doc internally if there is a conflict in the discussion. We can make it a formal process if needed; note that we are still improving the process in this repo as needed/required.
I'm just saying that I would prefer to know it before hand and it would have been very useful to me to have it in the dev process docs. I would have started with it in mind and would ease up may work load. Keeping in mind that bigger features are going to come and that you desire the Hyperspace community to grow, It would be very helpful to have it stated in your docs for further big chunks of code changes.
@imback82, @sezruby: To set this clear so I wouldn't get through this process of "re-creating PRs again and again", look at the the following set of PRs that I want to create for this feature:
Q1: Can we agree on the PRs lists above? Q2: Do I have your agreement that you'll focus on reviewing each of them?
@andrei-ionescu Q1: Looks good Q2: Yep, but 1 ~ 2 PRs (at a time) would be good; just don't create all PRs at first.
Keeping in mind that bigger features are going to come and that you desire the Hyperspace community to grow, It would be very helpful to have it stated in your docs for further big chunks of code changes.
Sure, I will update the contribution guide.
Q1: Can we agree on the PRs lists above?
The list looks reasonable to me.
Q2: Do I have your agreement that you'll focus on reviewing each of them?
Sorry about reviewing the PRs late. We were not expecting a big feature from the external contributors, so I acknowledge that we didn't allocate resources correctly. We will try to fix this moving forward. And to answer your concern, yes, once the PRs are in a manageable size, we will review them as soon as we can. I expect you create these PRs one by one, and not all at once?
@imback82 I will create them as my time allows. If I'll be able to create all of them at once then I'll do it like that. What I can say is that I'll start creating them in the agreed order. I'll ping you guys on each of them as they land.
What is the context for this pull request?
What changes were proposed in this pull request?
This PR adds support for indexing over nested fields (ie: structs).
The first commit is adding support for building the index over a nested (struct) field and support for modifying the search query to properly use that index. It has suport for hybrid scans for both append and delete files in the hybrid scan context.
The second commit will address the join use case.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit and integration tests for not breaking the existing functionalities. Unit and integration test added for the new functionalities.
Search queries
The following search queries use a dataset with the following schema:
Files Appended
The optimized plan
The Spark plan
Files Deleted
The optimized plan
The Spark plan
Join queries
The following join queries will have a dataset a bit different from the one at the beginning. The following are extracted from the
HybridScanForNestedFieldsTest
tests.Join append only
Original plan
Altered optimized plan
Altered Spark plan
Delete files
Original plan
Altered optimized plan
Altered Spark plan
Append + Delete
Original plan
Altered optimized plan
Altere Spark plan