meltano / sdk

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com
https://sdk.meltano.com
Apache License 2.0
87 stars 64 forks source link

docs: better recommendation for dynamic streams #1662

Open pnadolny13 opened 1 year ago

pnadolny13 commented 1 year ago

I don't feel like we have a definitive answer to the question of how to properly use dynamic schema discovery and the questions are becoming more and more frequent. There's been a bunch of different implementations floating around but its hard to understand the recommended approach. I feel like I'm pretty far in the weeds with the SDK and it wasn't clear to me so I can see this being an easy tripping point for new tap developers.

Recently this has been coming up a lot in slack.

The SDK docs includes a section that shows how to override the schema https://github.com/meltano/sdk/blob/main/docs/code_samples.md#dynamically-discovering-schema-for-a-stream. This is what I followed for tap-dynamodb but it was revealed to me that doing it this was clobbers the input catalog.

Considerations:

Implementations I've seen:

  1. tap netsuite. Using the schema property override method and manually handles input catalogs by accessing private attributes.
  2. tap-dynamodb v1 - uses the schema property override that gets called using the super.init first. Needed to set private primary keys manually and input catalogs were not respected.
  3. tap-dynamodb v2 - similar to above, using schema property override but now I check for input catalog during init and pass it to the base stream class if it exists. The challenge was that without doing this manually in the init the base stream class was receiving a null schema and wasnt checking for an input catalog.
  4. tap-mongodb z3z1ma - overrides the catalog dict in the tap class. This also access private input catalog attributes to decide if it should dynamically generate or not. I find overriding the full catalog and having to manually handle the metadata, schema, stream name, keys, etc. The SDK usually helps abstract a lot of this so I'd prefer to differ it to the SDK so it handles it correctly. Its also possible that this is the proper implementation but this mongodb tap is an advanced use case and a more default implementation could be cleaner and safer for someone whos not as familiar with the internals of Singer.
  5. tap-mongodb mensenski - similar to z3z1ma but seems more grokable 😅 to me, maybe just due to the code organization. It has to manually handle detecting if it should use the input catalog and generating catalog entries in a separate connector class.

Questions:

cc @edgarrmondragon we talked a bit about this. I'm happy to help update the docs you or @kgpayne gave me guidance on the recommended approach.

stale[bot] commented 2 months ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

pnadolny13 commented 2 months ago

Still relevant