kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.88k stars 893 forks source link

Add remote (cloud) options for --conf-source #3982

Open OlegPodlipalin opened 3 months ago

OlegPodlipalin commented 3 months ago

Description

Currently, we deploy Kedro projects as part of our SaaS solutions within Docker containers, managed by a Prefect server on AWS ECS. These containers are configured to handle all possible configurations for various tenants. However, as we transition to an on-prem solution, we face a challenge where exposing all tenant configurations within a single container poses security and relevance issues since the client will have access to configurations that are not pertinent to them.

We're looking for a way to modularize configuration management without exposing unnecessary data. While Kedro supports cloud storage in the data catalog via the "protocol" option with the fsspec library, it does not support a similar mechanism for the configuration source. Enhancing this functionality would significantly streamline the deployment process for our on-prem solutions.

Context

This change is crucial for us as it would allow for more secure and relevant configurations per tenant, without the risk of exposing data not intended for them. Implementing this feature would also align configuration management with how data sources are currently handled, making our systems more robust and flexible. This feature could benefit other users who require more granular control over configuration management in environments where host file systems are not a viable option.

Possible Implementation

A potential implementation could involve extending the existing functionality of Kedro's OmegaConfLoader to recognize and handle various storage protocols using the _parse_filepath function from kedro/io/core.py lines 690-728:

def _parse_filepath(filepath: str) -> dict[str, str]:
    """Split filepath on protocol and path. Based on `fsspec.utils.infer_storage_options`.

    Args:
        filepath: Either local absolute file path or URL (s3://bucket/file.csv)

    Returns:
        Parsed filepath.
    """
    if (
        re.match(r"^[a-zA-Z]:[\\/]", filepath)
        or re.match(r"^[a-zA-Z0-9]+://", filepath) is None
    ):
        return {"protocol": "file", "path": filepath}

    parsed_path = urlsplit(filepath)
    protocol = parsed_path.scheme or "file"

    if protocol in HTTP_PROTOCOLS:
        return {"protocol": protocol, "path": filepath}

    path = parsed_path.path
    if protocol == "file":
        windows_path = re.match(r"^/([a-zA-Z])[:|]([\\/].*)$", path)
        if windows_path:
            path = ":".join(windows_path.groups())

    options = {"protocol": protocol, "path": path}

    if parsed_path.netloc and protocol in CLOUD_PROTOCOLS:
        host_with_port = parsed_path.netloc.rsplit("@", 1)[-1]
        host = host_with_port.rsplit(":", 1)[0]
        options["path"] = host + options["path"]
        # Azure Data Lake Storage Gen2 URIs can store the container name in the
        # 'username' field of a URL (@ syntax), so we need to add it to the path
        if protocol == "abfss" and parsed_path.username:
            options["path"] = parsed_path.username + "@" + options["path"]

    return options

integrating this logic into the OmegaConfLoader kedro/config/omegaconf_config.py lines 127-138 under the else statement, for example:

file_mimetype, _ = mimetypes.guess_type(conf_source)
if file_mimetype == "application/x-tar":
    self._protocol = "tar"
elif file_mimetype in (
    "application/zip",
    "application/x-zip-compressed",
    "application/zip-compressed",
):
    self._protocol = "zip"
else:
    self._protocol = "file"
self._fs = fsspec.filesystem(protocol=self._protocol, fo=conf_source)

will enable the use of fsspec protocols for configuration files would allow configurations to be fetched dynamically from specified storages, whether they are local filesystems or cloud-based. Credentials in this case can probably be provided via environmental variables, which are set up during the starting of the container.

astrojuanlu commented 3 months ago

Thanks @OlegPodlipalin for opening this issue.

@deepyaman, @bpmeek and others were discussing about something like this a few days ago https://linen-slack.kedro.org/t/19613568/hey-everyone-a-while-ago-i-took-part-in-a-kedro-user-group-a#ceca4bdb-8037-4ec3-80e3-7eee0476bc6b and in fact @bpmeek open sourced https://github.com/bpmeek/Kedro-Universal-Catalog prompted by some past discussions.

Something similar was already proposed in https://github.com/kedro-org/kedro/issues/1239 and probably in other places.

We will add this to our backlog and discuss it shortly.

astrojuanlu commented 3 months ago

@datajoely I quickly scanned https://github.com/kedro-org/kedro/issues/3094 and I didn't see any mentions to config, do you have any extra insight? Also @DimedS @ankatiyar do you think this would help with, say, Airflow deployments?

datajoely commented 3 months ago

Yes, IIRC we mention how the introduction of --conf-source is already a good step in the right direction, it was much worse before!

I actually really want this to happen, we actually support this kind of pattern in the micropackaging piece we're about to remove... so we should just steal the already written implementation.

image

DimedS commented 3 months ago

@datajoely I quickly scanned #3094 and I didn't see any mentions to config, do you have any extra insight? Also @DimedS @ankatiyar do you think this would help with, say, Airflow deployments?

Yes, it will be quite useful for Airflow deployment for the same reason. It won't be necessary to pack the configuration into the same container if you can retrieve it from remote storage. However, I'm trying to understand how this will increase security. Could you please comment on this, @OlegPodlipalin? If someone gains access to the container, wouldn't they be able to download all configurations stored in cloud services?

OlegPodlipalin commented 3 months ago

I'm trying to understand how this will increase security. Could you please comment on this, @OlegPodlipalin?

@DimedS It will increase security while adding an on-prem solution to existing SaaS. I our case we plan to mirror our ECR with all it's images to a client's infrastructure. The intension is not to build another set of images. The configuration for a new on-prem client will be created separately anyways and it will be the only set available for them, whereas all the rest of the configurations (for all the SaaS) will be stored in our s3.

ElenaKhaustova commented 2 months ago

@OlegPodlipalin, thank you for the issue created! We've discussed it and we think is a good idea, so we would like to move forward with it. Do you want to create a PR?

OlegPodlipalin commented 2 months ago

@ElenaKhaustova I'm ready to get started on this. I'll fork the repo and work on the solution. For accessing the cloud with the conf folder, I'm thinking of using environment variables for credentials.