Open aaronsteers opened 1 year ago
@aaronsteers you make good points here but is this different from https://github.com/meltano/sdk/issues/1333? It sounds like a proposal for implementing that feature. I'll put my thoughts here but it might fragment our discussion if this is in two issues.
Around naming when I hear dry_run
to me it means that nothing is actually running like a compile step whereas this dry run would actually sync real records but just not as many of them. I like the idea of including test
in the naming, showing that it shouldnt be used in production to sync small batches.
if the first n parents don't have any children, then no records for the child stream would be synced. Example: I'm syncing with max count of 10. My parent stream is Issues and my child stream is Issue comments. If my first 10 issues do not have comments, then exactly zero Issue comments records will be synced, even though the max for that stream type was never met.
It seems like there could be 2 different implementations that we can either chose from or implement both and let the user chose whats best for their use case:
test_stream_max_records
- stop syncing after max records are synced. Meaning child streams might end with zero records because the parent reached its maximum and ended before the child got enough records.test_stream_min_records
- stop syncing after a minimum threshold. Meaning a parent stream could have far more than the minimum before the child reaches its stopping point i.e. 1000 slack messages (parent) and 100 thread (child) messages.I guess my initial thought of how this would work is closer to test_stream_min_records
where it would keep running until it reaches the minimum required records. So the slack tap might find 100 messages before it finds 100 thread messages, leading it to continue extracting 1000 messages before it gets the minimum 100 thread messages and is able to stop. I wondered if it would be better to not emit those extra 900 records or instead make it a min_record_count
. In this case theres still the issue that potentially all records need to be extracted from the source (regardless of whether they are emitted) before the minimum threshold is reached or its never reached, having stream level configs might make this easier if we could say "for this child stream the minimum is 5 records and everything else is 100" but ultimately thats the user's job to use this properly.
My initial idea for this in https://github.com/meltano/sdk/issues/1333 was for testing where I want it to stop after X records because my start date isnt sufficient to limit the size of my dataset. Using test_stream_min_records
while keeping my start_date would probably solve it. Worst case scenario I have to sync the day of data (status quo, not a big deal) and best case I end after 1 min of syncing because I have enough data. I know this might not fit everyone's use cases though.
What do you think? I'm not married to these ideas but wanted to share my thoughts in case its helpful context.
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
Still relevant
We could theoretically expose a record limit in config for SDK-based taps. This would basically be a more advanced implementation of
--test
CLI flag.However, it is important to properly disclaim the following caveats for any users that opt into this behavior:
orders
table, for instance, might not match all - or even any - of the customer IDs from the first 100 records of the customers table.n
parents don't have any children, then no records for the child stream would be synced. Example: I'm syncing with max count of10
. My parent stream isIssues
and my child stream isIssue comments
. If my first 10 issues do not have comments, then exactly zeroIssue comments
records will be synced, even though the max for that stream type was never met.For these reasons, I suggest we pay special attention to naming. I propose above a
dry_run_
prefix but there are other ways this could be communicated as well.A similar approach would be something like
dry_run: true
to enable a developer-specified default (or the SDK default of something like 1000), possibly also allowing an integer instead of the boolean valuedry_run: 10
- with docs clearly explaining thattrue
uses the built-in defaults and any othern
will override the limit at the users' request.Alternative using the CLI
An alternative using just the CLI would be to accept this integer as an optional input to
--test
. So,--test
today tries to sync zero or one records per stream, where-as if--test=N
were specified, we'd try to sync (at least)N
records per stream.cc @kgpayne, @edgarrmondragon
Related: