Open apoorvedave1 opened 4 years ago
Hi @apoorvedave1 I am new to hyperspace and keen to contribute.
I was trying to figure out the root cause. I guess, the issue can be traced from this line of code
val usedIndexes = indexes.filter(indexes("indexLocation").isin(getPaths(plan): _*))
IndexLocation is updated after the refresh index incremental.
After the update, the indexLocation path was stripped of the version at the end. It was coming out to be the indexRootPath.
Ideally, it should have been IndexRootPath/v__=1
It requires a better understanding of core knowledge. I will try to spend some more time and figure it out. Let me know if you have any pointers about it.
Thanks @rohitashJain , that's very helpful triage.
I think we can make this more robust by stripping the v__=#
section completely from the hyperspace.indexes api.
As you pointed out, the behavior of hyperspace.indexes is different before and after refresh. Before refresh we see v__=0
in the "indexLocation" column, after refresh we don's see v__=1
at all.
This looks like a bug. To make this consistent, one option is to fix this private def indexDirPath
method call.
Problem: indexDirPath returns different outputs before and after incremental index refresh. When all index files belong to same directory, it adds v__=#
to the returned value. If index files belong to different directories, it strips away the v__
part.
Possible Solution: To make the behavior appropriate (and consistent across refreshes), we should always strip away the v__=
part from the output.
Possible Solution: One way could be to use the spark.conf.get("spark.hyperspace.system.path")
config. Use the index system path and directly find out index root path from it. For e.g.
Case 1: Index Files:
/systempath/myIndex/v__=0/f1
/systempath/myIndex/v__=1/f2
Case 2: Index Files:
/systempath/myIndex/v__=0/f1
Output (same irrespective of index files being in one folder or multiple folders). Return root index folder (strip away version):
/systempath/myIndex/
cc @imback82
Then we can move on to hyperspace.explain
api fix:
I guess, the issue can be traced from this line of code
val usedIndexes = indexes.filter(indexes("indexLocation").isin(getPaths(plan): _*))
We can then update the getPaths(plan) to return index root paths without the v__
, (and remove duplicates).
Thanks @rohitashJain, Please let me know if I was unclear or if I can help in some way.
@rohitashJain https://github.com/microsoft/hyperspace/issues/251 this one could be fixed first. What do you think?
@apoorvedave1 Looks good to me. As you suggested, we would need to fix two pieces
But can we fix only private def indexDirPath(entry: IndexLogEntry) - to return always the latestVersionPath always
Actually, it's about optics, In hyperspace.index dataframe index('indexLocation') column gives us the location of index. The question is should it be the indexRootDirectory or the actual versioned location of the index. Let's say we go with indexRootDirectory, should we also add another field in IndexSummary case class to include the version information for debugging. I guess the version information is sometimes, helpful in debugging and it would be good if we can retain it.
I am interested in contributing to this bug. Let me know your thoughts, I can make changes accordingly.
Describe the issue
Index is used in the query and explain output shows modification correctly. But "Index Used" section is blank.
To Reproduce
Output
Expected behavior
Environment