singer-io / getting-started

This repository is a getting started guide to Singer.
https://singer.io
1.26k stars 148 forks source link

How do we want to specify primary key fields? #6

Closed mdelaurentis closed 7 years ago

mdelaurentis commented 7 years ago

I can only think of two real options for specifying which fields should be used as the primary key:

  1. Put a "key": true property on each of the key fields in the SCHEMA message. This is what we have right now, in that the target looks for that property and will set the key fields based on that.
  2. Put a top-level "key_names" property in the message. For example:

    
    {"type": "SCHEMA",
     "stream": "users",
     "schema": { "..." },
     "key_names": ["customer_id", "email"]}

The advantage of the first option is that we avoid cluttering up the top level of the message with additional properties, which is I think desirable. The advantage of the second option is that it would make the choice of primary key columns more explicit and obvious. If we went with option 1, it would be easy for a tap author to simply forget to mark some fields as keys. We have already forgotten to do that with the last four taps we've done. With option 2, we could require a "key_names" field, even if it points to an empty list.

I would vote for option 2, because I would prefer to be as explicit as possible about what the key fields are.

karstendick commented 7 years ago

I think option 2 is the best approach, too. It's better to be explicit than not, and this approach will minimize the chance of errors and therefore our future support burden.

briansloane commented 7 years ago

Option 2 seems reasonable. Can key_names be optional in that scenario? If missing then we assume the empty list?

karstendick commented 7 years ago

It'd be better to be explicit. If you want append-only behavior, then you have to specify key_names: []

This will minimize the chance of errors and support tickets.

cmerrick commented 7 years ago

Option 2 also prevents keys on non-top level fields, which is desirable.

There's a third option: specify a "key_names": [...] property within the top level object of the schema. I still lean towards Option 2, though.

mdelaurentis commented 7 years ago

Great. Option 2. Top-level "key_names" property.

cmerrick commented 7 years ago

I find "key_fields" to be a more descriptive name for this. "key_properties" is also more descriptive, and would match JSON schema lingo

On Feb 8, 2017, at 8:42 AM, Mike DeLaurentis notifications@github.com wrote:

Great. Option 2. Top-level "key_names" property.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mdelaurentis commented 7 years ago

Alright, "key_properties" works.