kubeflow-kale / kale

Kubeflow’s superfood for Data Scientists
http://kubeflow-kale.github.io
Apache License 2.0
629 stars 129 forks source link

Define pipeline dependencies & configuration within a cell instead of using UI annotations #212

Open alexlatchford opened 4 years ago

alexlatchford commented 4 years ago

Wanted to throw out this idea and see if people thought this was good idea, not sure how feasible this is though!

Some push back we've had with Kale is that by defining workflows as annotations it is harder to test them and to compose them. I think this is fair criticism but I'm still advocating for the UX model Kale strives for as I do not see a better pattern for iterative notebook development which allows for remote executions.

I wondered if a solution to this problem would be to tag a cell as a singleton "Pipeline" cell, then that would compose the dependencies and inject other configurations (ie. resource limits, tolerations, etc.) needed to actually define a working pipeline that'll run remotely. This would live the applied scientist to just manage naming cells via the UI (or even via a special magics (%%) or comments syntax).

Unclear to me what the contents of the code inside that "Pipeline" cell would be, potentially KFP syntax, TFX (or the TFX IR) or something else but wondered what the maintainers thought of this idea 😄

PS. Thanks for building Kale, when I think about how to enable our scientists to iterate on a notebook, deploy to a remote cluster and then continue their development Kale definitely seems like the future of this!

davidspek commented 3 years ago

@alexlatchford It seems this might be what you are after: https://github.com/kubeflow-kale/kale/tree/master-old#tagging-language

alexlatchford commented 3 years ago

@alexlatchford It seems this might be what you are after: https://github.com/kubeflow-kale/kale/tree/master-old#tagging-language

Thanks @DavidSpek I'm familiar with how to currently specify a pipeline using Kale but facing some opposition within my company for that not being testable. I understand notebooks are not really designed with testing in my but what I proposed instead was having some programmatic syntax (ideally python) that could be defined within a cell which could then be tested (likely via tests in another cell). Currently the consensus on that format is likely KFP itself and instead of compiling the entire notebook down to a pipeline that cell would become the pipeline that then references the other cells somehow. Obviously a very nebulous concept at the moment, especially given the complexity Kale handles for state management across the notebook.

davidspek commented 3 years ago

@alexlatchford I'm not sure what type of testing you have in mind, but could you not perform the required testing using the notebook file's raw JSON as input?