meltano / sdk

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com
https://sdk.meltano.com
Apache License 2.0
98 stars 70 forks source link

Consider exposing config like `dry_run_record_limit` #1366

Open aaronsteers opened 1 year ago

aaronsteers commented 1 year ago

We could theoretically expose a record limit in config for SDK-based taps. This would basically be a more advanced implementation of --test CLI flag.

However, it is important to properly disclaim the following caveats for any users that opt into this behavior:

  1. This mode will very likely break referential integrity expectations.
    • The 'first 100' records from the orders table, for instance, might not match all - or even any - of the customer IDs from the first 100 records of the customers table.
  2. This mode will break incremental sync expectations for many taps.
    • There's no guarantee that this will align with paging requirements of the upstream source, and for many taps, you would not be able to simply "run it again" to get the next 100 records per stream.
  3. The total number of records sent for child streams will be indeterminant, and could be unexpectedly more or less than the max.
    • If the first n parents don't have any children, then no records for the child stream would be synced. Example: I'm syncing with max count of 10. My parent stream is Issues and my child stream is Issue comments. If my first 10 issues do not have comments, then exactly zero Issue comments records will be synced, even though the max for that stream type was never met.
    • Conversely, if the 10 parent streams each have exactly 5 child records (for instance), then the total records streams for the child stream will be 50, since each pass through the child streams would reset the counter. The total records synced could be therefor be much more than the max.
    • Both of these behaviors are tolerable if communicated clearly to users. Most example cases would not mind 50 records instead of 10 for child streams. And conversely, if child items are sparse, the user could increase the limit to something large enough to ensure at least one child item is sent.

For these reasons, I suggest we pay special attention to naming. I propose above a dry_run_ prefix but there are other ways this could be communicated as well.

A similar approach would be something like dry_run: true to enable a developer-specified default (or the SDK default of something like 1000), possibly also allowing an integer instead of the boolean value dry_run: 10 - with docs clearly explaining that true uses the built-in defaults and any other n will override the limit at the users' request.

Alternative using the CLI

An alternative using just the CLI would be to accept this integer as an optional input to --test. So, --test today tries to sync zero or one records per stream, where-as if --test=N were specified, we'd try to sync (at least) N records per stream.

cc @kgpayne, @edgarrmondragon

Related:

pnadolny13 commented 1 year ago

@aaronsteers you make good points here but is this different from https://github.com/meltano/sdk/issues/1333? It sounds like a proposal for implementing that feature. I'll put my thoughts here but it might fragment our discussion if this is in two issues.

Around naming when I hear dry_run to me it means that nothing is actually running like a compile step whereas this dry run would actually sync real records but just not as many of them. I like the idea of including test in the naming, showing that it shouldnt be used in production to sync small batches.

if the first n parents don't have any children, then no records for the child stream would be synced. Example: I'm syncing with max count of 10. My parent stream is Issues and my child stream is Issue comments. If my first 10 issues do not have comments, then exactly zero Issue comments records will be synced, even though the max for that stream type was never met.

It seems like there could be 2 different implementations that we can either chose from or implement both and let the user chose whats best for their use case:

I guess my initial thought of how this would work is closer to test_stream_min_records where it would keep running until it reaches the minimum required records. So the slack tap might find 100 messages before it finds 100 thread messages, leading it to continue extracting 1000 messages before it gets the minimum 100 thread messages and is able to stop. I wondered if it would be better to not emit those extra 900 records or instead make it a min_record_count. In this case theres still the issue that potentially all records need to be extracted from the source (regardless of whether they are emitted) before the minimum threshold is reached or its never reached, having stream level configs might make this easier if we could say "for this child stream the minimum is 5 records and everything else is 100" but ultimately thats the user's job to use this properly.

My initial idea for this in https://github.com/meltano/sdk/issues/1333 was for testing where I want it to stop after X records because my start date isnt sufficient to limit the size of my dataset. Using test_stream_min_records while keeping my start_date would probably solve it. Worst case scenario I have to sync the day of data (status quo, not a big deal) and best case I end after 1 min of syncing because I have enough data. I know this might not fit everyone's use cases though.

What do you think? I'm not married to these ideas but wanted to share my thoughts in case its helpful context.

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

pnadolny13 commented 1 year ago

Still relevant