kedro-org / kedro-viz

Visualise your Kedro data and machine-learning pipelines and track your experiments.
https://demo.kedro.org
Apache License 2.0
662 stars 110 forks source link

Kedro new --tools=viz produce files like `stats.json` in the root directory #1891

Closed noklam closed 2 months ago

noklam commented 7 months ago

Description

image

https://linen-slack.kedro.org/t/16359356/i-ve-just-been-playing-with-kedro-gt-0-19-0-what-is-creating#ac36141d-bfa9-4cd5-84d7-a2a683992687

Context

The root cause is in 0.19 we introduced a new project creation flow. We can put it in a .viz folder and make sure version control ignores it. These files are meant to be internal for Kedro-viz, thus they should not be stored in the project root directory.

I think it should be fine to put it in a hidden folder. And I'd also be in favour of adding stats.json and session_store.db to the .gitignore file. (edited)

Steps to Reproduce

Expected Result

Actual Result

-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

astrojuanlu commented 6 months ago

Spotted this in the wild https://gitlab.com/odyssean-institute/asrs-trade-kedro (found by @stichbury)

astrojuanlu commented 6 months ago

Note that we should change the default of SESSION_STORE_ARGS, for example

https://github.com/kedro-org/kedro-starters/blob/a829a9b29b9f0020994e71d69dcadd100caf8d7b/spaceflights-pandas-viz/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/settings.py#L22

ravi-kumar-pilla commented 3 months ago

Hi Team,

So the idea is to have .viz folder for all the files related to kedro-viz (stats.json, session_store.db) and other files in future.

For the creation of .viz folder -

  1. At this moment, stats.json is created by a hook on kedro-viz side. We can create the .viz folder during the hook run but the users can disable the hook via settings of the kedro project, so this does not seem a right approach
  2. Should we make .viz folder a default folder that is created when someone runs kedro new --tools=viz ? This seems to be a more streamlined approach to create the folder. The viz hook can check if the folder exists, if not it can also create the folder during the hook run.

Few questions -

  1. Since the stats.json file is in the root directory for existing projects, should we also in some way delete the file for the user ? Or document the new changes ?
  2. Is there a way to create a kedro project without using kedro new as the documentation states there are several ways to create a kedro project (which I am not aware of). In that case, creating a .viz folder becomes a manual approach and we should document the same.

@noklam do you have any suggestions here ?

Thank you

ravi-kumar-pilla commented 3 months ago

Hi @astrojuanlu and @noklam,

I have created a naive migration draft PR - https://github.com/kedro-org/kedro-viz/pull/1915/files

What it does -

  1. This will move the files created for viz inside the .viz folder
  2. Users need to change the SESSION_STORE_ARGS as SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / ".viz")} . This needs to be done for all the starters (manual approach and I am not sure of any alternative at this moment)
  3. We defined a constant on viz side VIZ_METADATA_LOCATION which will be used by viz hook to place stats.json

Questions:

  1. How can we force users to point SESSION_STORE_ARGS to parent/.viz ?
  2. Is there a better approach than the manual approach mentioned in (2) above for the starters ?
  3. stats.json is always re-written for kedro run, so the migration is not an issue. But session_store.db needs migration to get the old ET runs. Since this is just a file, we can ask users to copy it, but should we have a migration script which will run based on the location (I can think of a naive approach i.e., if the new location has .viz -> run the migration. Something else would be to look at the parent and check for .db files as we do in _merge method. This assumes user created parent/.viz ).

Please suggest when you have some time

cc: @rashidakanchwala

Thank you

astrojuanlu commented 3 months ago

Upon skimming @ravi-kumar-pilla's PR and comment, I wonder if this is really a Kedro-Viz problem. Would this be solved by

  1. changing the default SESSION_STORE_ARGS in all our starters, and
  2. adding .viz to the .gitignore in all starters

?

noklam commented 3 months ago

We need changes on both side:

  1. kedro-viz must change because right now the stats.json is hard coded to project_path/stats.json, which does not respect the SESSION_STORE_ARGS.
  2. kedro-starter need to update the "default" session_store_args to .viz folder.

Bonus: In my opinion, it would be better to not have SESSION_STORE_ARGS value in settings.py, we can keep the commented version there for discoveribilty, but kedro-viz should have a default location, if SESSION_STORE_ARGS exists then override.

We are also overloading "SESSION_STORE_ARGS" as "VIZ_METADATA_ARGS" here, the stats.json has nothing to do with session store. For longer term, the question is WHERE is a good place for placing plugin specific config? Whatever we are doing here is only possible with kedro-viz. Let's say if kedro-mlflow also come up with some new project config, it's not possible to update our starters to match it.

ravi-kumar-pilla commented 3 months ago

For longer term, the question is WHERE is a good place for placing plugin specific config? Whatever we are doing here is only possible with kedro-viz. Let's say if kedro-mlflow also come up with some new project config, it's not possible to update our starters to match it.

agree with @noklam on this. It should be plugin's responsibility of having default locations.

  1. Ideally, kedro-starters should not have any config variables specific to plugins.
  2. If there is a need for plugin specific config in kedro, how about having those as inputs while creating the project, something like - kedro new --tools=viz -> this will prompt for viz specific config file location ? -> if not provided, viz should have default settings for all the config needed. So kedro will not store individual variables but only file path for plugin config if provided.
  3. This would help in dealing with future config variables and allowing users to customize their project.

Thank you

astrojuanlu commented 3 months ago

For longer term, the question is WHERE is a good place for placing plugin specific config?

I guess you mean artifacts rather than config, but anyway 👍🏼

Whatever we are doing here is only possible with kedro-viz. Let's say if kedro-mlflow also come up with some new project config, it's not possible to update our starters to match it.

If anything, we can provide a set of good practices for plugins, but it will be difficult to prescribe the general case because we can't anticipate everything the users will want to do.

And yet, our starters are a special case, precisely because we try to provide pre-configured templates for our users. I'm not sure it's possible or reasonable for starters to try to stay away from plugin-specific stuff, while also providing a template that uses said plugin... If we were to provide a spaceflights-mlflow starter we'd be in a similar situation probably.

this will prompt for viz specific config file location

I'd rather not overload users with questions on the wizard, it's on us to provide a sensible default

kedro-viz should have a default location, if SESSION_STORE_ARGS exists then override. We are also overloading "SESSION_STORE_ARGS" as "VIZ_METADATA_ARGS" here, the stats.json has nothing to do with session store.

Good points. Other thoughts?

astrojuanlu commented 3 months ago

@ravi-kumar-pilla @rashidakanchwala Can we make it so that Kedro-Viz assumes a default SESSION_STORE_ARGS of

SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / ".viz")}

?

That would be the first step towards fixing the issue. The second one would be to update the starters to actually remove the SESSION_STORE_ARGS from them, so that Viz uses its default.

We are also overloading "SESSION_STORE_ARGS" as "VIZ_METADATA_ARGS" here, the stats.json has nothing to do with session store.

@noklam Could you open a new issue about this describing it in a bit more detail?

ravi-kumar-pilla commented 3 months ago

Hi @astrojuanlu and @noklam ,

Thanks for the suggestions. Based on the comments, as a first step -

On Kedro-Viz side -

  1. Create 2 constants at package/kedro_viz/constants.py -
    SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / ".viz")}
    VIZ_METADATA_ARGS = {"path": str(Path(__file__).parents[2] / ".viz")}
  2. If the user provides SESSION_STORE_ARGS, Kedro-Viz will use it to store the session_store.db file. If not, it will use the SESSION_STORE_ARGS["path"] .
  3. Start writing/reading stats.json to/from VIZ_METADATA_ARGS["path"] with the new release.
  4. Document about the default SESSION_STORE_ARGS here .
  5. Document about stats.json being created under .viz in kedro docs if they have viz installed and did not disable hooks for plugins.
  6. Starters will not be modified for now.

Let me know your thoughts.

Thank you