Open pirz opened 3 years ago
@imback82 @apoorvedave1 @sezruby @rapoth Can you take a look at this proposal and leave your comments (if any)? Thnx
createIndex(df, mode = "include", columns = Seq("C1", "C2"))
: Pick{"C1", "C2"}
as the included columns.
The current createIndex
API takes in IndexConfig
. Can you update the examples with indexed columns as well?
For source data files, we need to use Parquet's
mergeSchema
as the data could have columns not listed as indexed or included columns.
What if source data files are non-Parquet?
The current
createIndex
API takes inIndexConfig
. Can you update the examples with indexed columns as well?
The examples are updated by adding IndexConfig
.
@imback82 I updated the proposal to handle schema in source data files. Specifically I added a new section: "Index Creation and Refresh changes" and modified "Index Metadata Changes". Plz take a look and let me know about your comments. Thnx!
Generally LGTM! How about includeColumns
excludeColumns
rather than mode
+ columns
?
I wonder incremental refresh will work well with the previous data. Could you create a prototype PR?
def refreshIndex(
indexName: String,
mode: String,
includedColumns: IncludedColumns, // optional, used if change needed in current included columns
schema: StructType // optional, used if includedColumns.include is non-empty
)
includedColumns + schema
can be includedColumnSchema
or includedColumns
? since schema
also includes the name.
And I think it's okay not to support optimizeIndex for changed schema at first, to avoid complexity; we could do it later if it's needed.
Problem Description
This design proposal is for adding feature request #229.
Currently, Hyperspace supports creating indexes only on data with fixed schema. This means:
This makes it impossible to support creating index on data with evolving schema.
It is inevitable to enforce "no change" restriction on "indexed" columns as index records are bucketized and sorted on them; However, "included" columns are really payload data and do not affect physical layout of index files, therefore "no change" restriction on them could be lifted.
There are two cases to consider for schema changes:
Goal
The user should be able to create/refresh an index on data with evolving schema with ease.
General Assumptions
String
in a set of records andInt
in other records, when trying to load all these records into a DataFrame using the "mergeSchema" option, ParquetReader fails with error:Failed to merge incompatible data types string and int
.)null
).Solution Overview
Requirements
createIndex
andrefreshIndex
APIs should let user define "included" column(s) easily by including or excluding columns from the data schema.API changes
We make changes to the
createIndex
andrefreshIndex
APIs to let the user provide information about included columns.Changes in
IndexConfig
We modify index config so that instead of receiving a single list of included columns, it receives an instance of
IncludedColumns
which contains two lists of columns: "include" and "exclude". They show the columns to include/exclude, according to a known reference schema, to define index's included columns.Changes in
createIndex
APIcreateIndex
API remains unchanged; However the list of included columns is now computed according to the above change made toIndexConfig
.During index creation, we use the schema from user's given DataFrame
df
as the reference schema and compute included columns using this schema with include/exclude columns fromIndexConfig.includedColumns
.As an example, assume user wants to create an index on DataFrame
df
whose schema has 5 columns:{"C0", "C1", "C2", "C3", "C4"}
and
"C0"
is picked as the indexed column.Below is how
IndexConfig.includedColumns
can be used to define different sets of included columns:{"C1", "C2"}
as included columns:{"C1", "C2"}
from data schema and pick{"C3", "C4"}
as included columns:{"C1", "C2", "C3", "C4"}
as its included columns:We use above
IncludedColumns
instance (i.e.cols
) for index creation as:Note that if lineage is enabled, we make a minor change to the way index schema is extended:
Changes in
refreshIndex
APIDuring refresh index, Hyperspace has to create a DataFrame,
df
, on source data files that need to be indexed. We useschema(schema: StructType)
API inDataFrameReader
to define the schema fordf
. Note that we do not need to have full source data schema here, but only the relevant columns for the index which are columns to be indexed or included during refresh.During refresh, user can use an instance of
IncludedColumns
(defined above) to modify existing included columns.If there is no change in included columns or if user only removes some existing included column(s), we can simply use index schema from the latest index version as DataFrame schema when creating
df
.If user adds new columns to included columns, we need to add those columns to the
df
schema. For Parquet and Json, this is straightforward as they are self-describing formats. However, for csv, the schema should be correctly extended to have the name and data type for any new included column.We add two "optional" arguments:
includedColumns
andschema
torefreshIndex
API to address above:includedColumns
is used to define changes in included columns:includedColumns.include
is treated as a new included column and will be added to index records created during refresh.includedColumns.exclude
will be removed from the list of included columns and index records created during refresh will not have it.schema
is used when there are some new columns inincludedColumns.include
and it defines the data type for each of those columns. We compute the schema fordf
by merging this schema with latest index schema minus excluded columns defined inIncludedColumns.exclude
(if any).Index Metadata Changes
Index metadata is updated so that:
CoveringIndex.Properties.schemaString
: Captures "merged" schema of all valid index files.CoveringIndex.Properties.Columns.included
: Captures "latest" set of included columns.Note that there could be some column(s) in
schemaString
which are not amongincluded
columns. This can happen if user adds a column as included column in some early version of index and then drops it when refreshing index inincremental
mode.Moreover, we do not need to store source data schema under
source.plan.properties.relations.head.dataSchemaJson
. Currently, this schema is only used to definedf
schema when refreshing. However, with the changes explained above torefreshIndex
API, this schema can now be computed from index'sschemaString
andincluded
columns plus the newschema
argument added torefreshIndex
.Index leverage at query optimization/execution
Given that an index schema can now be changed by adding/removing included columns, index Parquet files are no longer guaranteed to have the same schema. One option to load them into a DataFrame is using
mergeSchema
option, however as this option is costly, we avoid using it and instead index files are loaded by providing the merged schema from index metadata. This can be done by settingschema(<merged schema from metadata>)
when loading index content into a DataFrame. Existing rules also set this merged schema as the relation schema when replacing data source with an index.Impact on Index Optimization
As the set of included columns can be changed when refreshing index, index records that belong to the same bucket could have different schema. This affects the
OptimizeAction
as it operates on merging separate smaller Parquet index files, whose content belong to the same bucket, into a single large Parquet index file. However, a given Parquet file can only have a single schema. Therefore:OptimizeAction
code to check/avoid merging index files with heterogeneous schema.There are two approaches for fixing
OptimizeAction
:null
values in records for missing columns.null
value is generated due to merging heterogenous schemas.Example Scenario
Here is an example scenario showing how index metadata and index files would be changed as an index goes through refresh and optimization while both source data and included columns are changed.
Implementation
IndexConfig
and modifycreateIndex