kestra-io / kestra

:zap: Open-source workflow automation platform. Orchestrate any language using YAML, hundreds of integrations. Alternative to Airflow, n8n, RunDeck, Camunda, Jenkins...
https://kestra.io
Apache License 2.0
9.59k stars 751 forks source link

Add a lightweight Namespace Management and KV Store #3609

Closed anna-geller closed 2 months ago

anna-geller commented 5 months ago

Feature description

The key value store will be implemented on top of internal storage for the following reasons:

  1. Privacy: we want Kestra to never store users' private data. This means that all values will be stored in the user’s private cloud storage bucket, and the kestra's database only contains metadata about it, such as the key, file URI, any attached metadata about this object, TTL, creation date, last updated timestamp, etc.
  2. Ease of implementation/migration: users can easily switch from open-source to cloud/EE because the implementation and data storage will be the same regardless of whether Kestra runs on top of Kafka or JDBC.

Keys and values

Keys are arbitrary strings

Values are stored as ION files in internal storage.

Thanks to ION, we can support strong types (as with inputs): datetime, int, float, string, FILE type etc.

Namespace binding

Key value pairs are tied to a namespace.

Users should be able to create and read KV pairs across namespaces as long as those namespaces are allowed https://github.com/kestra-io/kestra-ee/issues/1099.


Namespace management in OSS

To enable namespace binding in the OSS edition, we’ll introduce a lightweight version of Namespace Management (currently available only in EE) to the open-source edition, including:

  1. the Overview tab,
  2. the Dependencies tab,
  3. the Editor tab (https://github.com/kestra-io/kestra-ee/issues/1199)
  4. the KV Store tab: https://www.figma.com/file/ew0uXk0NRXJ2NBBJTNe2n1/UI?type=design&node-id=1465-18728&mode=design&t=TmrlE3Cx7ewyBHYl-0

Namespaces_OSS-restrictions

All other tabs will be greyed out with a hint that they are available on EE if the user needs them.

KV Store UI

The UI for KV will look a lot like the Secrets UI with a button at the top to create a new KV pair “New KV pair”

https://www.figma.com/file/ew0uXk0NRXJ2NBBJTNe2n1/UI?type=design&node-id=1456%3A18323&mode=design&t=Wm5ld1tT8VcTcGL9-1

image

You can Create, Read, Update or Delete KV pairs using:

KV Store core plugin

Set (or modify) a KV pair

id: set_kv
type: io.kestra.plugin.kv.Set
key: myvariable
value: "{{ outputs.query.uri }}"
namespace: dev # the current namespace of the flow can be used by default
overwrite: true # whether to overwrite or fail if a value for that key already exists; default true
ttl: P30D # optional TTL

Get a KV pair

The easiest way to retrieve a value by key is to use the Pebble function following this syntax:

{{ kv('VARIABLE_NAME', namespace_name, errorOnMissing_boolean) }}
# for example, to retrieve the previously create "myfile":
{{ kv('myvariable').myfile }} # assuming you retrieve it in a flow in the same namespace as the one for which key was created

If you prefer, you can also retrieve the value using a task:

id: get_value_by_key
type: io.kestra.core.tasks.kv.Get
key: myvariable
namespace: dev # the current namespace of the flow can be used by default
errorOnMissing: false # bool

And if you want to check if some values already exist for a given key, you can search keys by prefix:

id: get_keys_by_prefix
type: io.kestra.core.tasks.kv.GetKeys
prefix: "myvar"
namespace: dev # the current namespace of the flow can be used by default
errorOnMissing: false # bool

The output will be a list of keys—if no keys were found, an empty list will be returned.

Delete a KV pair

id: delete_kv_pair
type: io.kestra.core.tasks.kv.Delete
key: myvariable
namespace: dev # the current namespace of the flow can be used by default
errorOnMissing: false 

On EE, we need dedicated permission (might be called KVSTORE) to allow fine-granular access to create, read, update or delete KV pairs on specific namespaces.


Extra notes

  1. Given that all values are stored in internal storage, no payload limit is required.
  2. The ttl will be lazily evaluated, i.e., only if the user tries to retrieve the value and the value is past its TTL, the key will be deleted, and we'll return null + a friendly message clarifying the expiration of the key.
  3. The Purge task cannot be used to purge old keys as Purge is tied to executions. We'll need to add a new task, e.g., PurgeKV, to support purging expired keys (or all keys past a certain creation date if needed).

Extra context

Why not just use State Store?

State Store ist challenging to use. Common issues include:

  1. Being able to see what values are persisted across flows and namespaces
  2. Being able to inspect those values from the UI (see the value for a given key)
  3. Being able to see when the key was initially created and the last time updated
  4. Being able to set the type for that saved value
  5. (TBD later scope) Potentially also being able to react to changes in the state store as a simple decoupling mechanism

Use cases

Use cases this will enable:

  1. Keep the last timestamp scraped from an API
  2. Keep the last message or file processed to easily determine whether some new processing should take place or not
  3. (TBD later scope) KV change generates an audit log — this will allow to e.g. take action whenever the value has changed
brian-mulier-p commented 4 months ago

Better to go for the long-term one since allowedNamespace is now added :+1: Dumb question but I'm wondering about the interest of keeping namespace variables (aside from compatibility reasons) ?

EDIT: since KV mandate JSON values it seems that's the reason why but for new users I guess variables don't add much and won't be used

anna-geller commented 4 months ago

Yes, the short term version is no longer relevant

You're right that we still need to figure out the format for the values, I'm still not sure whether the entire value should be mandated to always be JSON instead of e.g. supporting simple strings too - TBD

brian-mulier-p commented 3 months ago

the kestra's database only contains metadata about it, such as the key, file URI, any attached metadata about this object, TTL, creation date, last updated timestamp, etc.

Can we consider leveraging custom metadata from storages directly on stored objects instead of splitting the implementation across both DB and storage ? There is only an issue with LocalStorage as it's purely file-system based we can't have such mecanism :/

Another subject: when you talk about namespace binding, do you also speak in term of permission ? Should I tie the read create delete update to the NAMESPACE permission ?

loicmathieu commented 3 months ago

Can we consider leveraging custom metadata from storages directly on stored objects instead of splitting the implementation across both DB and storage ? There is only an issue with LocalStorage as it's purely file-system based we can't have such mecanism :/

Yes, it could be convenient for other usages. For the LocalStorage, you can for ex have a file.extension.metadata file with the information in it.

brian-mulier-p commented 3 months ago

The only painpoint is when listing KVs we will need to read every metadata file but we can be ok with the fact that it won't be as performant

MilosPaunovic commented 3 months ago

@anna-geller For this task to be successfully implemented to OSS, we need a way for our users to open a single namespace in the UI, and that is the namespace listing we overhauled for the EE version.

But, tricky part there is that we also need to handle NS end points on the BE, as well. Do we want to use this issue or open another one for backend?

anna-geller commented 3 months ago

Milos, always feel free to create new issues if it makes things easier for you