locationtech / geowave

GeoWave provides geospatial and temporal indexing on top of Accumulo, HBase, BigTable, Cassandra, Kudu, Redis, RocksDB, and DynamoDB.
Apache License 2.0
502 stars 190 forks source link

Import/Export capability #353

Closed chrisbennight closed 8 years ago

chrisbennight commented 9 years ago

This capability might be part of geowave, or might be rolled as a separate project.

The base need is to provide the ability to export a geowave dataset (or subset of a dataset) to a single file, and import that same file back in to geowave.

The use case here is two-fold:

  1. Provide a mechanism to backup geowave data sets
    • This means by default (i.e. no additional arugments) an export -> import cycle should result in a data set that's functionally and semantically equivalent to the original
  2. Provide a mechanism to snapshot data to simplify persistence format changes (between versions)
    • This pushes a desire to make the serialization format somewhat independent from persistence formats where possible (it's a balance between duplicating code vs. duplicating dependencies). With full re-use we loose the ability to make the serialization format independent of persistence changes (i.e. we could just export r-files).
    • Export Functionality
    • Takes a geowave namespace as an input
    • Exports (to HDFS) a serialized format of the dataset (single file) which contains (contains here means can be derived from)
    • Feature type definition
    • Original index definition
    • Original namespace name
    • Data values (including visibility)
    • Stretch functionality (not required in initial version, might be moved to separate enhancement ticket)
    • Ability to provide a CQL filter which subsets the data being exported
    • Import Functionality
    • Takes a serialized data file (single file) and imports it back into a geowave instance
    • By default pulls namespace name, feature type definition, and index configuration from serialized file
    • Allows user to optionally override namespace and index configuration
    • Stretch functionality
    • Allows user to override feature type
      • Need to further specify functionality - how are feature types mappings expressed in this case?
      • Merge capability (key off feature id?)
    • Serialization Details
    • File size is relevant
    • Avro is what we have leveraged in other places for this, should use here unless there's a strong reason not to.
    • Consider creating a feature collection concept so the feature names don't have to duplicated in every feature instance.
    • Some structure required - we want to roll index definitions, feature definitions, etc. into a single file with all the features (don't want to deal with multiple files).
    • One of the metadata fields should be a hash of the data collection
    • Stretch goal: optional parity/ecc support
    • Data adapters: SimpleFeature support is what's immediately required, but some thought/design on handling other data adapters. Might have a pluggable serialization capability that keys off the data adapter class? (If we re-use the data adapter directly that might prevent us from using this capability to mitigate persistence changes)
rwgdrummer commented 9 years ago

See #380. The Avro Schema-Simple Feature Type construction is required in both tickets.

The only difference is in the serialization details: (1) feature collection concept. (2) Store the feature definitions, Avro schema, index into a separate meta-data file.

Keep in mind this ticket is an GeoWave 'export' -> files in HDFS -> 'import' to Geowave.

Importing to GeoWave should be thought in general terms. This means the the DataStore is not assumed to be Accumulo. The DataStore of choice could be configured using SPI or some other property. One way to deal with this is to make adaptations to the GeoWaveOutputFormat (which ingests into GeoWave). I think this adaptation can be a separate ticket.

The key thing to keep in mind is the file contents should be GeoWave version agnostic. Class names and binary encoded images are not supported. Hence, why Avro represents a good choice. The schema describing the the data at rest in the file system may differ from the schema describing the data to be imported (attribute add/removed/rename etc.). To support this idea, here is a story.

A user wants to add a new feature attribute to feature data stored in GeoWave. In addition, the user wants to use the latest GeoWave version. However, the feature data adapter has changed , adding a new attribute to its serialized image. The user uses the export tool to export there existing data along with the meta-data definitions using the older version of GeoWave. The user defines the schema for the new feature type. The user provides a transformation function to take features of the prior version to fill in the new attribute of the new version of data. The GeoWave team has conveniently provided transformation function that transforms the feature data adapter meta-data to the new version of the adapter, providing a value for the added attribute to the feature data adapter. Armed with the transformations, the user can exercise the import tool. The import tool is run using the newer version of GeoWave.

Some initial design artifacts can be: (1) The structure for the meta-data file. Is this Avro as well? If so, is it beast creating avro representation for data adapters and indices?
(2) The components of the import/ingest portion of the process. For example, specification of the destination feature and the transformation function to be applied. Transformation takes an object (e.g. Simple feature) from the stored format and transforms into another object (added/removed attributes, etc). Recall there are two types of transformation: data and meta-data (adapters, indices, etc.).

rfecher commented 8 years ago

This is a parent issue for #684, #686, and #687