meltano / meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
https://meltano.com/
MIT License
1.72k stars 150 forks source link

Refresh catalog on every invoke (fresh_catalog: true) #2848

Closed MeltyBot closed 1 month ago

MeltyBot commented 2 years ago

Broad discussion here:


Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/2907

Originally created by @vischous on 2021-08-27 12:26:32


I want to have Meltano build a new catalog every run when running meltano invoke tap-oracle

A key at the SingerPlugin level probably makes sense, maybe call it fresh_catalog default to false?

More specific than #2850 as #2627 didn't solve what I'm after. What I really want is a way to manage and watch my catalog change over time (#2677 / #2805 ), but this issue will be an incremental improvement over where I"m at today.

Today I delete the catalog and cache key from .meltano/run/tap-name/*

MeltyBot commented 2 years ago

View 5 previous comments from the original issue on GitLab

cwegener commented 2 years ago

Picking up this conversation after @edgarrmondragon pointed me to #2856 in Slack conversation.

Whilst providing an explicit catalog cache refresh via CLI (#2856) and via configuration setting (this issue) is a good medium term option, the important short-term enhancement is actually to address the point that @tayloramurphy pointed out, which is that the fact of a cached catalog file is currently not mentioned in the documentation.

I've had a quick look through the documentation, and the most sensible place to add a note about the cached catalog (and potentially a hint about using a tap reinstall as a workaround for the missing cache clear feature) seems to be the CLI reference documentation. e.g. in the select section.

Running select is how I had my first introduction to the catalog cache (using transferwise tap-postgres). And if I had chosen to start reading the documentation in order to find out why my select output does not match my list of source entities, I probably would have ended up reading the select CLI reference documentation.

visch commented 2 years ago

Here's some other ideas

  1. Log clearly whether a cached catalog is being used or not (Plus the great docs advice from @cwegener )
  2. What if we cached following a TTL (similar to DNS). Default TTL on invoke would be something like 24 hours, and it'd be configurable. TTL would be ignored if the select changes (just like today), but if the select doesn't change we don't bother updating the Catalog for a day. This was one of the more confusing parts to me as I'd run a project the same way for a while and then realize I'm using an old catalog after a few days / weeks of working.
tayloramurphy commented 2 years ago

@cwegener I made https://github.com/meltano/meltano/issues/6292 to track updating the documentation!

tayloramurphy commented 2 years ago

This could be interesting in the context of:

cc @aaronsteers

techtangents commented 1 year ago

This would be useful to my team during development, as we're changing the imported schema a lot.

visch commented 1 year ago

Finally figured out why this is such a frustrating use case for me and haven't articulated it.

When you set something like

  - name: tap-name
    inherit_from: tap-postgres
    select:
    - thissupertable.*

When the thissupertable table gets new columns added the catalog never gets updated because Meltano says "the select statement hasn't changed" therefore everything is good to go.

Whenever I use * in select I implicitly expect Meltano is going to check and update my catalog every run. When I don't use * then I don't' expect it.

tayloramurphy commented 1 year ago

Good discussion on this in https://meltano.slack.com/archives/CKHP6G5V4/p1663205560565299

As I think more on this I think having some sort of mechanism to alert on catalog changes would be very beneficial. I'm in favor of a more near-term fix where we can enable users to specify something like refresh_catalog: true but longer term I want to be more thoughtful on the workflows and what we do with the catalog. There's huge value in this metadata to teams and a lot we can do with it (also thinking for managed).

aaronsteers commented 1 year ago

@tayloramurphy and @visch - from the discussion...

There seems to be a path forward with catalog_caching being able to be declared as an extra and being able to be disabled with meltano config tap-something set _catalog_caching disabled. One nice thing about not starting with a simple true/false, is that we could expand this in future to be meltano config tap-something set _catalog_caching '60 min' to allow short-lived catalogs in the future.

As an initial boolean toggle though, I think we probably would want to use the true/false or enabled/disabled value to drive the following behaviors:

What do you think?

tayloramurphy commented 1 year ago

@aaronsteers that seems reasonable as a short-term fix. Long term I like the idea of meltano catalog - I think we can drive huge value around helping with the catalog and alerting on diffs prior to run execution. This is basically the "data contract" of the Singer world...

edgarrmondragon commented 1 month ago

Arguably done by #8580. Feel free to comment if something's missing.

visch commented 1 month ago

Arguably done by #8580. Feel free to comment if something's missing.

The PR looks like it solves the problem to me, we'll implement this and come back if it doesnt' work! Thank you, this cleans up a number of things for us

edgarrmondragon commented 1 month ago

Yeah, just so folks don't have to dig through the PR/docs the options in Meltano 3.5.0a1+ are:

  1. Set the use_cached_catalog: false extra setting
  2. Use the --refresh-catalog option of meltano [run|el|elt|select]