singer-io / tap-mongodb

GNU Affero General Public License v3.0
28 stars 38 forks source link

What targets/databases has this been used with? - Assume schema properties from documents #5

Closed MadLittleMods closed 5 years ago

MadLittleMods commented 5 years ago

tldr;


tap-mongodb generates a schema with no properties which isn't playing nice with target-snowflake. It errors but even after fixing the error, none of the fields are transferred. Just a bunch of rows with time_extracted field

      "schema": {
        "type": "object"
      }

Just curious, what targets has tap-mongodb been used with so I can test and compare with target-snowflake?

dmosorast commented 5 years ago

@MadLittleMods I believe it was run with target-csv in the past, but given the loose structure of data in MongoDB, I can certainly understand that it could fit into the edge cases of some targets' expectations with schema/records.

I can think of a few possibilities to consider. That should be a valid JSON schema as written, but I'm sure there are targets/libraries/functions that expect a "properties" key to exist in the schema when the type is object. Adding that could make it more reliable.

The other thing that comes to mind, is you may need to write "selected": true metadata to have the tap replicate certain fields. I'm not familiar with how this tap operates with regard to field selection, but that's a common Singer paradigm, so it could be worth a shot. I wouldn't be surprised if the tap is strictly whitelisting fields to replicate by some means due to the variable nature of object collections' schemas.

MadLittleMods commented 5 years ago

@dmosorast

Could we update tap-mongodb to generate SCHEMA entries from the dynamic Mongo document itself? As it gets streamed, if the schema(assumed from the document) changes, we can emit a new SCHEMA.target-snowflake supports schema updates along the way, schema_updates.stream

Or should targets just support "schema": { "type": "object" } and be able to handle the dynamic data and add schema fields as necessary?

btw, I work at GitLab, specifically on the Gitter team trying to get data from Mongo to Snowflake for analysis in Looker. GitLab also made target-snowflake and Meltano

We are currently discussing usage in https://gitlab.com/meltano/meltano/issues/113

dmosorast commented 5 years ago

@MadLittleMods Thanks for your patience on this, I've had a lot of discussions to aggregate thoughts on how this fits into the Singer world. It's a bit of a unique situation with MongoDB's stance on data typing and schema. Here is a brain dump of what I've gathered:

Who's Responsible for Data Types? In general, the final word on data typing should be the responsibility of the target, given the schema provided by the tap, since it knows the most about the destination columns. In the absence of a complete schema, where a record contains extra fields without type information, the target should either drop those fields, or make a best effort to infer the data type.

Schema w/NoSQL Mongo data typing is challenging due to the way it's used. There are cases I have seen where an object contains keys that are generated dynamically (like a hash), so when mapping those to a table, it ends up creating an ever growing set of columns, unless loaded as a JSON field. Unfortunately JSON Schema doesn't have a great way to specify "this should remain as JSON", so this case is still an open question.

Your suggestion of emitting a schema as it changes should work with most targets, as they should be updating their internal schema upon receipt of a message. However, I have performance concerns about the time it would take to perform a full schema generation + diff for each object retrieved. This might not be an issue in practice, but seemed worthwhile to mention.

Potential Path It seems like the use case can cover a schemaless mode (where data typing is inferred from the column data) and a strict schema mode (where fields are whitelisted and typed, so you only get the pieces of the object you want).

So, in the general case of a strict schema mode, I can imagine something like a schema being specified in a specialized object within each collection, or in a specific schema collection that contains objects that map collections to their schema information, which can be used for extraction queries and data typing. That way the tap can take a config option pointing it to either a collection to find schema information or an object ID per collection for that information.


Does that make sense? If you want to solve for your immediate use case without the general concerns, you can feel free to do the development work in a Fork and we can continue discussing it for incorporation.

dmosorast commented 5 years ago

Perhaps that was a bit long winded for a simple answer, but the TLDR as it pertains to you question is that the tap should provide as much information as it can about the schema of the data, and the target should use that, along with its knowledge of the destination platform to make the best decision at the time of data typing.

The rest is just the ever present question of how?. 😄

MadLittleMods commented 5 years ago

@dmosorast Just to be clear, I assume the proposed schema-mode: strict|schemaless property is for tap-mongodb? This could be defined in config.json and be added to each catalog.json stream metadata. This way you could override it for certain streams if you wanted.

I think my current use case can be solved by just checking the latest document in the collection.

I know in my case, the schema of gitter_client_access_events has changed a bit over time (we now have user-agent information stored agent:type, etc). So if we wanted to gather everything, we would need to sample more than one ID.


For reference, with target-snowflake, nested JSON/objects in Mongo are de-nested -> (Snowflake fields) _id, t, d__user_id, d__client_id, d__tag

db.gitter_client_access_events.find().sort({_id:-1}).limit(1).pretty()
{
        "_id" : ObjectId("5bb57845dc300f6a30f3d09f"),
        "t" : ISODate("2018-10-04T02:17:41.137Z"),
        "d" : {
                "userId" : "xxx",
                "clientId" : "xxx",
                "clientName" : "Web Client",
                "tag" : "web-app",
        }
}
micaelbergeron commented 5 years ago

@MadLittleMods I think the canonical form for tap-mongodb to send its records is exactly what you have there. The schema associated with that should be fixed as

{
    "type": "object",
    "properties": [
        {"id": {"type": "string"}},
        {"d": {"type": "object"}},
        {"t": {"type": "string" ... (whatever is needed for date fields)}}
    ]
}

Then let the target unnest the fields and the output should have at least 3 columns: id, d, t, and the unnested version that will not be validated. I don't think there is an easy way out of this.

dmosorast commented 5 years ago

@MadLittleMods Correct, the configuration option I mentioned would be for the tap itself. I would expect it to need to at least be defined on a stream level, and maybe default to something that makes sense when a collection is not present in the config (whichever works out the best between the two modes).

For example, the config could be something like:

{
    "schema_modes": {
        "users": "strict",
        "events": "schemaless",
        ...
    },
    ...
}

Checking the last document in the collection and inferring schema from that sounds like a good first pass without being too opinionated, and the de-nesting that target-snowflake does makes sense. I think this sounds like a viable path forward and am interested to see how it turns out!

dmosorast commented 5 years ago

Closing this for now. Feel free to open another issue or reopen if needed!

timvisher commented 5 years ago

@MadLittleMods @micaelbergeron @dmosorast This Issue became fairly wide-ranging in its discussion. Could one of you take a stab at writing up a TL;DR and updating the title and description with it for new comers to follow along? See https://github.com/singer-io/tap-quickbase/issues/20 for example.