kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.85k stars 893 forks source link

Using 'kedro catalog create' on existing YAML files removes old comments #2923

Open DimedS opened 1 year ago

DimedS commented 1 year ago

Description

When we use the kedro catalog create command on existing YAML file, we're losing comments from the old file.

Context

As of now, when we use the kedro catalog create command, it overwrites the existing file. However, we save old data(but lose comments) using the following approach:

if catalog_path.is_file():
        catalog_config = yaml.safe_load(catalog_path.read_text()) or {}
    else:
        catalog_config = {}
    for ds_name in missing_ds:
        catalog_config[ds_name] = {"type": "MemoryDataset"}
    with catalog_path.open(mode="w") as catalog_file:
        yaml.safe_dump(catalog_config, catalog_file, default_flow_style=False)

Possible Implementations

We could use the ruamel.yaml library instead of PyYAML. It preserves comments with the same code structure. The downside is that it's not as popular as PyYAML. How does this option sound to us?

We could use yaml.safe_dump in append mode, simply adding new catalog data after the end of the old file. However, this may lead to issues if the previous file was corrupted as we'd just be extending that corrupted file.

astrojuanlu commented 1 year ago

This is one interesting pattern I saw with Copier:

  /tmp/test-copier ·······································································  13s 13:54:0
❯ echo "Hello" > README.md                                                                      (kedro310) 
  /tmp/test-copier ············································································· 13:54:1
❯ ls                                                                                            (kedro310) 
README.md
  /tmp/test-copier ············································································· 13:54:1
❯ copier copy gh:astrojuanlu/copier-pylib .                                                     (kedro310) 
No git tags found in template; using HEAD as ref
🎤 Your account or organization
   astrojuanlu
🎤 The name of the project (as in `pip install {project_name}`
   test_project
🎤 The name of the package (as in `import {package_name}`)
   test_project
🎤 A short description of the extension
   Test project
🎤 Your name
   Juan Luis
🎤 Your email
   juan_luis_cano@mckinsey.com

Copying from template version 0.0.0.post44.dev0+0458d66
 identical  .
    create  .flake8
    create  pyproject.toml
    create  CODE_OF_CONDUCT.md
    create  .pre-commit-config.yaml
    create  tests
    create  tests/test_import.py
    create  LICENSE
    create  docs
    create  docs/Makefile
    create  docs/source
    create  docs/source/conf.py
    create  docs/source/index.md
    create  docs/source/_static
    create  docs/source/_static/.gitkeep
    create  .gitignore
    create  .github
    create  .github/workflows
    create  .github/workflows/test.yml
    create  .github/workflows/rtd-preview.yml
    create  .github/workflows/publish.yml
    create  CONTRIBUTING.md
    create  tox.ini
    create  .readthedocs.yaml
  conflict  README.md
 Overwrite README.md? [Y/n] n
    create  src
    create  src/test_project
    create  src/test_project/__init__.py
    create  src/test_project/py.typed
  /tmp/test-copier ·······································································  29s 13:54:5
❯ ls                                                                                            (kedro310) 
CODE_OF_CONDUCT.md  LICENSE             docs/               src/                tox.ini
CONTRIBUTING.md     README.md           pyproject.toml      tests/
  /tmp/test-copier ············································································· 13:55:1
❯ cat README.md                                                                                 (kedro310) 
Hello

I think overwriting existing files should be done with care, there might be something the user puts in there for some reason.

noklam commented 5 months ago

We should check other Kedro's CLI to see if there are any other commands overrides existing files. Should there be a confirmation steps or warning generated.

DimedS commented 4 months ago

We should check other Kedro's CLI to see if there are any other commands overrides existing files. Should there be a confirmation steps or warning generated.

I checked the kedro pipeline create command. It's wrapped in a try-catch block and fails when attempting to create a duplicate pipeline. However, for the kedro catalog create command, it's acceptable if the catalog was previously created because the command is designed for that scenario. You should use it to identify any missed datasets that weren't previously described in the catalog, so they will be mentioned as MemoryDatasets for further processing. Therefore, I believe it's okay to update the catalog file, but comments will be missing as mentioned in the current issue.