kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Create IDE plugins for major editors #2821

Closed astrojuanlu closed 4 days ago

astrojuanlu commented 1 year ago

We have evidence that users struggle when assigning catalog entries to node functions. For example, #2726:

when dataset names are strings, IDEs can't check that they actually match. I ran into this problem before when some code changes lead to the wrong dataset being used. When dataset names are variable names, this kind of issues will be caught sooner.

This is something that an IDE extension could help with. There is already a kedro-lsp extension (see https://github.com/kedro-org/kedro/issues/712#issuecomment-1172216531), but doesn't appear to be maintained anymore.

Such extension could help with other things, to be defined.

Evidence markers

astrojuanlu commented 11 months ago

PyCharm now supports LSP https://blog.jetbrains.com/platform/2023/07/lsp-for-plugin-developers/ via @datajoely

datajoely commented 11 months ago

@astrojuanlu excited to see this move ever so slightly forward.

The number one feature for me is the link between our "magic string" input/output/parameter references in node definitions and their YAML counterpart in the catalog. IDE users are used to ⌘ Command + Click symbols and jumping to their definition.

If we were to store the file/line number reference in the live catalog object we could do exciting things.

astrojuanlu commented 10 months ago

I had another user complain about hard navigation/autocomplete/typo detection for dataset names. Creating an IDE plugin as stated in this issue is of course one of the possible solutions, maybe others could be explored.

noklam commented 10 months ago

Maybe worth explore the kedro-lsp in 0.19

On Sat, 9 Sept 2023, 13:59 Juan Luis Cano Rodríguez, < @.***> wrote:

I had another user complain about hard navigation/autocomplete/typo detection for dataset names. Creating an IDE plugin as stated in this issue is of course one of the possible solutions, maybe others could be explored.

— Reply to this email directly, view it on GitHub https://github.com/kedro-org/kedro/issues/2821#issuecomment-1712505860, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELAWLZVWMWY3WNUUN2K2J3XZRR4VANCNFSM6AAAAAA2OUGNHI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

astrojuanlu commented 7 months ago

Nice work from the DVC folks on their VSCode extension

Screenshot 2023-12-07 at 12 28 16 Screenshot 2023-12-07 at 12 28 30
astrojuanlu commented 7 months ago

The pre-requisite for building this is to store the YAML line no/cursor position of the catalog entry at load-time.

For reference, ruamel.yaml provides the cursor information:

In [11]: from ruamel.yaml import YAML

In [12]: yaml = YAML()

In [13]: data = yaml.load("""
    ...:      # testing line and column based on SO
    ...:      # http://stackoverflow.com/questions/13319067/
    ...:      - key1: item 1
    ...:        key2: item 2
    ...:      - key3: another item 1
    ...:        key4: another item 2
    ...:         """)

In [14]: data
Out[14]: [{'key1': 'item 1', 'key2': 'item 2'}, {'key3': 'another item 1', 'key4': 'another item 2'}]

In [15]: type(data)
Out[15]: ruamel.yaml.comments.CommentedSeq

In [16]: data[0].lc
Out[16]: LineCol(3, 7)

In [17]: type(data[0])
Out[17]: ruamel.yaml.comments.CommentedMap

However:

I think some of this metadate we'd need is hidden by the OmeagaConf.load method

Indeed, OmegaConf.load I/O is coupled with PyYAML unless an already loaded object is directly provided:

https://github.com/omry/omegaconf/blob/fd730509ef10a074f97b1738c630720157ceeeab/omegaconf/omegaconf.py#L183-L193

I think this supports the idea of separating the loading from the resolving part as proposed in https://github.com/kedro-org/kedro/issues/2481

astrojuanlu commented 6 months ago

Similar case of trying to parse YAML and give good error messages:

The initial version of rattler-build used serde & serde_yaml to parse the recipe. That worked OK, but was limited because we could not get great error messages out of serde_yaml, since it doesn't report the locations (line & column) of the encountered issues.

https://prefix.dev/blog/rattler_build_a_new_parser

astrojuanlu commented 6 months ago

However, the primary downside for me was the requirement to set up configurations using YAML.I would prefer it to be closed within a Python script because editor completion.

https://www.reddit.com/r/kaggle/comments/18j4c0e/what_pipeline_libraries_do_you_recommend_for/?utm_source=share&utm_medium=web2x&context=3

astrojuanlu commented 5 months ago

A user asking about this https://linen-slack.kedro.org/t/16374428/hey-team-kedro-is-there-any-vscode-extension-through-which-w#4844d74c-e02a-48da-9a7f-db4fea33703a

inigohidalgo commented 5 months ago

Just saw this issue linked from the slack conversation.

This would solve my number one problem with kedro as a user. I raised this in a user interview some months back, and also in various conversations: the disconnect between string objects in python and the objects which those strings represent within kedro. Refactoring is a huge hassle whenever it involves changing the shape of pipelines as there always end up being orphan datasets.

So this issue has a huge upvote from me.

noklam commented 4 months ago

There are many technical complexity for this, i.e. dynamic generated config/catalog

Nonetheless, I believe it's a huge improvement and we don't necessary wait until we have a full solution. I spent quite a bit of time to look at the original kedro-lsp and finally make something that runs in 0.19.x. I'm pretty excited about this. lsp

datajoely commented 4 months ago

The tool tip is awesome here! I guess dataset factories would work the same, I still think it would be nice to have an equivalent of dbt compile where you could jump to the resolved config.

noklam commented 4 months ago

IDE Plugins are very broad, I have seen a few things mentions here and recalled some discussion in the past:

IDE support:

Backward compatibility: I am not sure yet how to make VSCode plugin that backward compatible, or maybe we shouldn't care this so soon because this can be a good reason to drive more 0.19 adoption.

User Research: (??) - there are many possibilities and unknown here

datajoely commented 4 months ago

Honestly I think VS Code is the right call for any initial MVP - you can just point to DVC and Databricks making the same call

astrojuanlu commented 4 months ago

Yeah the title of this issue is too broad. Let's start with an LSP for Kedro, basically bringing kedro-lsp back to life and working with Kedro 0.19 + docs for how to set it up on VSCode. That's already a massive usability boost for folks.

@noklam shall we open a separate issue for it?

noklam commented 4 months ago

3691 - let's move the LSP specific discussion here

noklam commented 3 months ago

https://blog.jetbrains.com/platform/2023/07/lsp-for-plugin-developers/

I did some research today and found that in theory Pycharm support LSPs, but it's only limited to paid users which is disappointing.

datajoely commented 3 months ago

Disappointing but I think it's still a great step especially if we target an enterprise user persona

astrojuanlu commented 2 months ago

VSCode plugin exists already, see #3691

astrojuanlu commented 4 days ago

Considering this done for now, if we ever decide to target other editors we can model those after the existing VS Code extension.