microsoft / hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
https://aka.ms/hyperspace
Apache License 2.0
424 stars 115 forks source link

Add dev config for nested column support #466

Closed sezruby closed 3 years ago

sezruby commented 3 years ago

What is the context for this pull request?

What changes were proposed in this pull request?

Disable index creation with nested columns until it's fully supported.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

sezruby commented 3 years ago

@andrei-ionescu I created a temp config for disabling nested column support (for v0.5 release). We could remove the config when it's fully supported. Could you have a look at this?

andrei-ionescu commented 3 years ago

@sezruby Please provide a clear explanation of the reason to add this flag. Is there a technical reason to add it and have this "nested column based index creation feature" disabled by default?

In the project are other features that are partially supported and I don't understand why this is different and needs to have this new flag.

By the way, adding this flag cannot stop one to set it and have the nested column based index creation enabled.

One more thing, if such a nested column based index is created, when they'll try to use it, it will not be applied. The results will be correct and impact is only that performance wise Hyperspace does not provide enough performance in this case. The user may just open some tickets in regards to it.

On the other hand, I offered my time to work on adding the full nested field support into Hyperspace for v0.5 but due to different priorities and lack of time from your team, the nested field support had to be postponed and I agreed with it, but with this PR I'm under the impression that this feature is in your way and needs to be removed (more work from my side to put it back). Maybe I'm getting it wrong and please correct me if so.

@rapoth, 1) Do you have some more info on this decision? 2) When should we resume the nested field support in Hyperspace? Do you have a time frame?

sezruby commented 3 years ago

@andrei-ionescu Sorry for the lack of explanation. We need to release v0.5 this month to deliver some bug fixes. Without the config, customers may create indexes with nested columns and ask why it's not applied. To avoid unnecessary time consuming analysis, (it's just for quick failure), I introduced this config to deliver 0.5 without those kind of issues.

It's ok to deliver incomplete PRs without the temp config while developing, but for release, we need to disable (or exclude) incomplete features to avoid unnecessary issue handling.

It's a temporal / hidden config, so I don't expect anyone will use the config.

andrei-ionescu commented 3 years ago

@sezruby The feature of creating an index over a nested field is also hidden, the users may not use it at all. And this is the way Hyperspace did work before I added it.

Before adding this functionality I remember that there hasn't been any error specific to the fact that nested fields are not supported - it just did not work.

From my perspective with or without it Hyperspace works the same as before.

So, let me see if I got right what you're saying:

  1. There is no technical issue forcing you do this.
  2. You're afraid that on your products the customers will possibly use this feature and wonder why does not work when querying.

My questions are:

1) How was it in version 0.4 and why keeping the feature as it is in 0.5 is worse? 2) Why do you want the customers to be locked in the fact that Hyperspace does not support indexes on nested fields? Is this nested field support totally removed from Hyperspace roadmap? 3) Can "nested field support" get into v0.5.0? 4) If "not" is the answer to question 3, then is it v0.6.0 and what's the timeframe for it?

@rapoth, @imback82: I may need you input here are there are some product related questions that I would like to get some answers. Thank you.

sezruby commented 3 years ago

1=> In version 0.4

scala> hs.createIndex(dff, IndexConfig("nestedIndex", Seq("nestedC"), Seq("hsh")))
// succeeded

scala> hs.createIndex(dff, IndexConfig("nestedIndex", Seq("nestedC.id"), Seq("hsh")))
com.microsoft.hyperspace.HyperspaceException: Index config is not applicable to dataframe schema.
  at com.microsoft.hyperspace.actions.CreateAction.validate(CreateAction.scala:53)
  at com.microsoft.hyperspace.actions.Action$class.run(Action.scala:88)
  at com.microsoft.hyperspace.actions.CreateAction.run(CreateAction.scala:30)
  at com.microsoft.hyperspace.index.IndexCollectionManager.create(IndexCollectionManager.scala:43)
  at com.microsoft.hyperspace.index.CachingIndexCollectionManager.create(CachingIndexCollectionManager.scala:77)
  at com.microsoft.hyperspace.Hyperspace.createIndex(Hyperspace.scala:42)
  ... 54 elided

// the index with "nestedC" can be applied
scala> val filter = dff.filter($"nestedC".isNotNull).select("hsh", "nestedC")
filter: org.apache.spark.sql.DataFrame = [hsh: string, nestedC: struct<id: string, leaf: struct<id: string, cnt: int>>]

scala> filter.queryExecution.optimizedPlan
res4: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [hsh#13, nestedC#17]
+- Filter isnotnull(nestedC#17)
   +- Relation[hsh#13,nestedC#17] Hyperspace(Type: CI, Name: nestedIndex, LogVersion: 1)

in current master:

scala> hs.createIndex(dff, IndexConfig("nestedIndex2", Seq("nestedC"), Seq("hsh")))
scala.MatchError: List() (of class scala.collection.immutable.Nil$)
  at com.microsoft.hyperspace.util.ResolverUtils$.com$microsoft$hyperspace$util$ResolverUtils$$getColumnNameFromSchema(ResolverUtils.scala:215)
  at com.microsoft.hyperspace.util.ResolverUtils$.com$microsoft$hyperspace$util$ResolverUtils$$getColumnNameFromSchema(ResolverUtils.scala:220)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1$$anonfun$apply$2.apply(ResolverUtils.scala:174)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1$$anonfun$apply$2.apply(ResolverUtils.scala:171)
  at scala.Option.map(Option.scala:146)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1.apply(ResolverUtils.scala:171)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1.apply(ResolverUtils.scala:168)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:296)
  at com.microsoft.hyperspace.util.ResolverUtils$.resolve(ResolverUtils.scala:168)
  at com.microsoft.hyperspace.actions.CreateAction.isValidIndexSchema(CreateAction.scala:77)
  at com.microsoft.hyperspace.actions.CreateAction.validate(CreateAction.scala:60)
  at com.microsoft.hyperspace.actions.Action$class.run(Action.scala:89)
  at com.microsoft.hyperspace.actions.CreateAction.run(CreateAction.scala:29)
  at com.microsoft.hyperspace.index.IndexCollectionManager.create(IndexCollectionManager.scala:47)
  at com.microsoft.hyperspace.index.CachingIndexCollectionManager.create(CachingIndexCollectionManager.scala:78)
  at com.microsoft.hyperspace.Hyperspace$$anonfun$createIndex$1.apply$mcV$sp(Hyperspace.scala:45)
  at com.microsoft.hyperspace.Hyperspace.withHyperspaceRuleDisabled(Hyperspace.scala:188)
  at com.microsoft.hyperspace.Hyperspace.createIndex(Hyperspace.scala:44)
  ... 54 elided

scala> hs.createIndex(dff, IndexConfig("nestedIndex2", Seq("nestedC.id"), Seq("hsh")))
=> succeeded

with this fix:

cala> hs.createIndex(dff, IndexConfig("nested3", Seq("nestedC"), Seq("hsh")))
scala.MatchError: List() (of class scala.collection.immutable.Nil$)
  at com.microsoft.hyperspace.util.ResolverUtils$.com$microsoft$hyperspace$util$ResolverUtils$$getColumnNameFromSchema(ResolverUtils.scala:215)
  at com.microsoft.hyperspace.util.ResolverUtils$.com$microsoft$hyperspace$util$ResolverUtils$$getColumnNameFromSchema(ResolverUtils.scala:220)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1$$anonfun$apply$2.apply(ResolverUtils.scala:174)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1$$anonfun$apply$2.apply(ResolverUtils.scala:171)
  at scala.Option.map(Option.scala:146)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1.apply(ResolverUtils.scala:171)
  at com.microsoft.hyperspace.util.ResolverUtils$$anonfun$1.apply(ResolverUtils.scala:168)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:296)
  at com.microsoft.hyperspace.util.ResolverUtils$.resolve(ResolverUtils.scala:168)
  at com.microsoft.hyperspace.actions.CreateAction.validate(CreateAction.scala:63)
  at com.microsoft.hyperspace.actions.Action$class.run(Action.scala:89)
  at com.microsoft.hyperspace.actions.CreateAction.run(CreateAction.scala:29)
  at com.microsoft.hyperspace.index.IndexCollectionManager.create(IndexCollectionManager.scala:47)
  at com.microsoft.hyperspace.index.CachingIndexCollectionManager.create(CachingIndexCollectionManager.scala:81)
  at com.microsoft.hyperspace.Hyperspace$$anonfun$createIndex$1.apply$mcV$sp(Hyperspace.scala:45)
  at com.microsoft.hyperspace.Hyperspace.withHyperspaceRuleDisabled(Hyperspace.scala:177)
  at com.microsoft.hyperspace.Hyperspace.createIndex(Hyperspace.scala:44)
  ... 55 elided

scala> hs.createIndex(dff, IndexConfig("nested3", Seq("nestedC.id"), Seq("hsh")))
com.microsoft.hyperspace.HyperspaceException: Hyperspace does not support nested columns.
  at com.microsoft.hyperspace.actions.CreateAction.validate(CreateAction.scala:71)
  at com.microsoft.hyperspace.actions.Action$class.run(Action.scala:89)
  at com.microsoft.hyperspace.actions.CreateAction.run(CreateAction.scala:29)
  at com.microsoft.hyperspace.index.IndexCollectionManager.create(IndexCollectionManager.scala:47)
  at com.microsoft.hyperspace.index.CachingIndexCollectionManager.create(CachingIndexCollectionManager.scala:81)
  at com.microsoft.hyperspace.Hyperspace$$anonfun$createIndex$1.apply$mcV$sp(Hyperspace.scala:45)
  at com.microsoft.hyperspace.Hyperspace.withHyperspaceRuleDisabled(Hyperspace.scala:177)
  at com.microsoft.hyperspace.Hyperspace.createIndex(Hyperspace.scala:44)
  ... 54 elided

From this result, I'll

IMO quick failure and a temp config for incomplete / unstable features are a kind of best practice for development. There's no reason we should keep the bad practice.

2=> I would have liked to put this PR, because

  1. successfully created (but useless) indexes might take much amount of data in the storage.
  2. successfully created indexes may give the impression that Hyperspace can support it.

This PR doesn't mean that Hyperspace won't support nested field at all in the future, and there's no such decision between our team. This is a quick fix for release, just because I strongly thought that we should disable incomplete features before release to avoid any confusion.

It was my mistake that I didn't inform you first before this PR and give enough information about it. This is the first time I work with outside of company and also open source community, there's still a bit of trial and error.

3=> Though our refactoring is almost done, still there're some remaining works. (@clee704 please correct me if I'm wrong) As we should release the version by this month, I'm not sure we could make it. v0.6 would be safe.

4=> There's no specific time frame but Data Skipping index + another type of index will be the included. It would be the end of July or August, from my expectation.

@rapoth and @imback82 are OOO until next week / next month, so there will be some delay on response.

clee704 commented 3 years ago

Having a feature toggle for a non-trivial under-tested new feature is a good thing. Maybe I should make one for DataSkippingIndex, too 😄

The PR message can be reworded like "Add a feature toggle for nested columns". I think we can drop "temp" in the config key, and make the feature fully pluggable at will. Or maybe "dev" instead of "temp".

andrei-ionescu commented 3 years ago

@sezruby I need a clear time frame when to get back working on the nested field feature. I need to plan my time as I also have other things on my plate and I would like as much as possible to stick to that time frame.

clee704 commented 3 years ago

As @clee704 noted, the same approach should be to any big feature regardless if is developed by your internal team or by someone from the open source community.

I'm not sure what do you mean by this. There is no half-implemented feature on the master branch except this. I'm going to add a feature toggle for DataSkippingIndex which is not merged yet, but you are against using a feature toggle. Can you explain why?

andrei-ionescu commented 3 years ago

@clee704

There is no half-implemented feature on the master branch except this

You guys have been merging different PRs with refactor in master that address a part of a future feature. For me that is a partial implemented feature. I don't want to go into this discussion.

I'm not sure what do you mean by this.

What I wanted to say is that I'm glad that you come with this idea of having a unified approach when developing big features.

but you are against using a feature toggle. Can you explain why?

It is not that I'm against a flag. It is just the fact that there is a simpler approach to disable it.

For the "nested feature" the simplest thing to do is to directly throw if a nested field is detected and disable the tests. When I'll resume developing the feature I'll remove the throw lines and enable the tests. Doing it with a flag all require do some more extra cleanup.

clee704 commented 3 years ago

You guys have been merging different PRs with refactor in master that address a part of a future feature. For me that is a partial implemented feature. I don't want to go into this discussion.

Refactoring is not a feature. A feature provides new functionality to users. Refactoring does not.

clee704 commented 3 years ago

Currently we're not doing it but disabling the tests can drop the code coverage and it would be an issue if we are going to add coverage check to CI. Having a non-tested, unused code sitting around in the codebase doesn't feel right.

But if you insist then I'm fine and I'll leave it up to @sezruby.

andrei-ionescu commented 3 years ago

@clee704

I already said that I prefer not to get into this discussion but now that you mentioned it I'll give some feedback.

You know the phrase: "don't fix it if it works!"

In the majority of the cases refactoring happens due to new features that needs to be added. So, with this being said, any refactor is a partial feature, at least from my perspective.

sezruby commented 3 years ago

I'd like to keep the dev config approach as I don't want to break the tests without knowing it. @clee704 Could you approve the change again? it's dismissed due to the merge conflict.

@andrei-ionescu please note that @clee704 and I are NOT the kind of person who doesn't fix the code just because it's working. @imback82 has even higher standards.

BTW WHY ARE YOU ARGUING WITH US? I don't understand what you're doing here. We made enough justification, but you're still complaining about it and blaming us. If you think this's the first time using disabled config in this project, I insist that I've done the rule refactoring with a temporal "disabled" package & tests for it, and enabled after it's fully delivered. Anyway this is not a helpful discussion at all.

clee704 commented 3 years ago

I forgot to add: the nested column support is only relevant to CoveringIndex. Other index types might support it without any change to the framework. For example, DataSkippingIndex can support nested columns out of the box because it doesn't directly store the column values in the index. Therefore, a more correct fix would be blocking nested columns in CoveringIndex. But this is a minor issue as currently we only have CoveringIndex now. This is just something to remember when we continue to work on the nested column support later.

andrei-ionescu commented 3 years ago

@sezruby I'm not arguing with you. I just propose different solutions than what you already did choose to do and I try to compare the differences between your proposal and mine and build a case for my proposal. The fact that it is in contradiction with what you are planning is in fact part of the development itself.

@andrei-ionescu please note that @clee704 and I are NOT the kind of person who doesn't fix the code just because it's working. @imback82 has even higher standards.

I never did think of you guys having low standards. On the contrary, my impression is that you have pretty high standards. I'm used to work in all kinds of projects (start-up, research, PoCs, production level, etc) and I'm comfortable in any of them. From my experience there should be a balance in regards the standards too, as these may slow the development in the early stages.

There are things that I see that you can improve in regards to doing open development. The biggest thing that in my opinion is hindering the open development is that there is a "double standard development" -- what's on your direct interests is treated differently than other things. This is understandable up to a point but after that it drives developers off the project. You need to find the balance.

There are other things as well in regards to open development but I'll stop here. If you want to discuss about these open dev things we can have a separate context.

andrei-ionescu commented 3 years ago

@sezruby I need a clear time frame when to get back working on the nested field feature. I need to plan my time as I also have other things on my plate and I would like as much as possible to stick to that time frame.

I'm getting back to my ask which didn't get an answer yet. Can you provide a time frame for me so that I can plan my time accordingly?

sezruby commented 3 years ago

I'm getting back to my ask which didn't get an answer yet. Can you provide a time frame for me so that I can plan my time accordingly?

I'm not sure but you can resume after merging #461 - early or mid July. As it also has some refactoring change and another refactoring change might be required for different index type support.