singer-io / getting-started

This repository is a getting started guide to Singer.
https://singer.io
1.26k stars 148 forks source link

Schema Editing #14

Closed mdelaurentis closed 7 years ago

mdelaurentis commented 7 years ago

We would like to modify the SPEC in order to provide the following additional features:

  1. Schema Discovery - Allow a Tap to indicate what the Schema will be without actually streaming data
  2. Stream and Field Selection - Allow the user to select a subset of the available streams and fields
  3. Stream Renaming - Allow the user to provide names for stream in order to resolve name collisions

We'll refer to these two features together as Schema Editing.

Motivation

  1. Stream Renaming

    Currently the RECORD and SCHEMA messages use a "stream" field to identify a stream. A data source that contains a hierarchy of namespaces, such as Postgres, it's not clear how a Tap should derive the name of the stream. For example, suppose we have a Postgres database called "prod", with a schema called "public", with a table called "users". If we were to call the stream "users", that might conflict with a table named "users" in some other schema. In order to disambiguate, we may want to allow the user to provide a mapping from the data source's notion of a stream or table to the Tap's stream name.

Note that Schema Discovery is required by both the Stream and Field Selection and Stream Renaming features.

Proposed Solution

Extend the specification as follows.

Add Discover Mode

Taps should allow an optional --discover command line flag. If --discover is provided, the Tap should print out a SCHEMA message for every stream that is available to it

It is expected that the user will edit the discovered schemas through some interface in order to delete schemas for streams they don't want, or delete specific fields they don't want. Then they can pass the resulting pruned schemas back in via a --schemas option...

Add Schema Selection

A Tap should allow an optional --schemas SCHEMAS argument that points to a file containing the list of schemas describing the desired output. It is expected that the schemas provided will be a pruned version of the schemas produced by a previous run of the same Tap in discover mode. The Tap should attempt to produce output that conforms to the schemas provided with the --schemas option. If no --schemas option is provided, the Tap should fetch all fields of all streams available.

Add "source" to SCHEMA message

Extend the SCHEMA message to add a "source" field, the structure of which is determined entirely by the Tap. The "source" field identifies the source of the stream. For a Tap that pulls from a database source, this could be something like

"source": {
  "schema": "public",
  "table": "users"
}

For a Tap that pulls from an API, it could be

"source": {
  "endpoint": "users"
}

If the user wants to rename a Stream, they can provide a --schemas argument that provides a new value for the "stream" field for the same source.

Example

Suppose we have a Postgres Tap, with a configuration that points to a database that has the following schema / table / field structure:

Suppose the Postgres tap normally names the stream "_

", so the stream names would be "public_users" and "public_orders".

So if we ran the Tap in discover mode:

$ tap_postgres --config config.json --discover > schemas.json

we would get the following output

{
  "type": "SCHEMA",
  "stream": "public_users",
  "source": {"schema": "public", "table": "users"},
  "key_properties": ["id"],
  "schema": {
    "type": "object",
    "properties": {
      "id": {"type": "integer"},
      "first_name": {"type": "string"},
      "last_name": {"type": "string"},
    }
  }
}
{
  "type": "SCHEMA",
  "stream": "public_orders",
  "source": {"schema": "public", "table": "orders"},
  "key_properties": ["id"],
  "schema": {
    "type": "object",
    "properties": {
      "id": {"type": "integer"},
      "user_id": {"type": "integer"},
      "amount": {"type": "number"},
      "credit_card_number": {"type": "string"}
    }
  }
}

Now let's assume the user wants to make the following changes to the schema:

  1. Remove the public_ prefix from the stream names
  2. Get rid of the users table
  3. Get rid of the credit card field

The user could make those changes by deleting the schema message for the users table, deleting the schema property for the "credit_card_number" field, and changing the stream name for the orders table:

{
  "type": "SCHEMA",
  "stream": "orders",
  "soruce": {"schema": "public",
                     "table": "orders"},
  "key_properties": ["id"],
  "schema": {
    "type": "object",
    "properties": {
      "id": {"type": "integer"},
      "user_id": {"type": "integer"},
      "amount": {"type": "number"},
    }
  }
}

So now the user would run the Tap again, specifying the edited schema file as input:

$ tap_postgres --config config.json --schemas schemas_edited.json

Concerns

  1. Can we come up with a better name for "discover mode"?

  2. How do we keep this from overly complicating taps that don't need schema selection?

    The schema editing adds a lot of complexity. For Taps that can provide very large sets of streams and fields, this is necessary. But what about a Tap with a small static schema, that doesn't need to support schema selection? In particular:

    1. Can a Tap choose not to support schema selection? If so,
    2. What should a Tap that doesn't support schema selection do if I call it with the --discover flag? Print out the schema and exit 0? Exit non-0?
    3. What should a Tap that doesn't support schema selection do if I call it with a --schemas SCHEMAS option? Ignore it, or fail?
    4. What if a Tap that does support schema selection is invoked with a --schemas SCHEMAS option where the schemas provided do not match the schema that's available to it?

    Given the complexity introduced by these changes, I'm inclined to say that we should make Stream and Field Selection and Stream Renaming optional parts of the spec, and say that a Tap that does not want to support these features should fail hard if they are invoked with a --discover or --schemas option.

karstendick commented 7 years ago

@mdelaurentis What's the relationship between the "source" and "location" fields?

karstendick commented 7 years ago

i. Yes. There will be taps with few streams and thus little need for schema selection. Also, this keeps the requirements simple for new taps.

ii. Unexpected command-line arguments should cause a tap to fail. Fail early and often, or else problems can linger undetected.

iii. Same answer: the tap should fail.

iv. Similarly, the tap should fail as soon as it detects the schema invalidity (hopefully before even doing any other work). Otherwise, what would its behavior be? Replicate just the intersection of desired and available data? How would it report the partial failure so that the user can act on it?

An immediate failure in this case would provide the quickest feedback to the user so that they can take the necessary action, such as re-running discovery mode and re-selecting their schema. Choosing to ignore this error would let a problem, such as a mistake in the schema selection, to remain undetected for too long. This would increase the support burden, as such problems will be more difficult to troubleshoot.

mdelaurentis commented 7 years ago

What's the relationship between the "source" and "location" fields?

They're the same thing. I was initially calling it "location", then I changed the name to "source" but apparently missed a few spots. I just fixed it. Thanks!

@karstendick I agree with your conclusions here https://github.com/singer-io/getting-started/issues/14#issuecomment-286856444

cmerrick commented 7 years ago

Add "source" to SCHEMA message

I like this concept. Calling it "source" might be confusing since we commonly talk of databases and APIs as data sources. Sounds like you considered "location" and nixed it - what about "path"?

The Tap should attempt to produce output that conforms to the schemas provided with the --schemas option.

This statement is hairy when you consider the input schemas could have data types that have been modified from the actual data sources's types. I don't think we want to use this feature for type coercion, and as long as that's the case it's worth saying so explicitly.

I'm inclined to say that we should make Stream and Field Selection and Stream Renaming optional parts of the spec

Agree

mdelaurentis commented 7 years ago

@cmerrick "path" make sense. We talked about this in person, but I'm not in favor of making this a "best practice" rather than even an optional part of the spec, since it doesn't affect the protocol between a Tap and Target.

mdelaurentis commented 7 years ago

We've done a little experimenting with this internally. Some feedback from Stitch developers:

  • We should not have "type": "SCHEMA" properties on the schemas emitted in discover mode, because they're not actually part of the stream that will be fed into a target.
  • Use "path" to refer to the schema and table.
  • We should probably change the format of the document that's fed back in to select streams and fields. Requiring a full JSON schema raises too many questions. What happens if the types on the schema specified on the command line disagree with the types in the actual data? The document we feed back in should just select streams and fields.
mdelaurentis commented 7 years ago

This is part of the best practices guide now. Closing.