kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.94k stars 904 forks source link

Kedro on PyCafe is sort of possible #4271

Open antonymilne opened 4 hours ago

antonymilne commented 4 hours ago

Greetings friends 😀 and FYI @maxschulz-COL @maartenbreddels

Just wanted to report that it is sort of possible to run Kedro in the browser using PyCafe! ☕ 🚀 There's a couple of workarounds needed but for the spaceflights project at least it's not too hard. Take a look here for my example project: https://py.cafe/antonymilne/kedro-vizro-dashboards

I did kedro new with kedro==0.19.9 and used the spaceflights example code (the project is in the folder descriptively named blah). The project is run in app.py. At the moment PyCafe requires some kind of app so there's also a simple Vizro app just so that the kedro code can run. Of course in reality this could be an actual interesting dashboard reporting results of the pipeline.

The catches/workarounds that I noticed with this simple example. No doubt there will be further difficulties for more realistic projects, and I didn't even try to install all the requirements in blah/requirements.txt, just took what looked like the bare minimum to get the pipeline to run.

  1. No wheel file available for the version of antlr4-python3-runtime that omegaconf currently requires. This is fixed in omegaconf 2.4.0 but that's only in dev release. I see that you're aware of this difficulty already in https://github.com/facebookresearch/hydra/issues/2699 and https://github.com/omry/omegaconf/issues/1158. So this is why I set omegaconf==2.4.0.dev3 in the requirements.txt, which appears to work well
  2. pre-commit-hooks has ruamel-yaml-clib as a transitive dependency, which doesn't have a wheel file available. I was actually quite surprised to see pre-commit-hooks as a kedro requirement, and I see this was a bit of a controversial addition at the time (https://github.com/kedro-org/kedro/pull/3436). It's easy to fix if you don't actually want to run pre-commit though, just with ruamel-yaml-clib # mock in requirements.txt.
  3. Can't remember exactly what it was but I had a problem with some of the datasets in catalog.yml so I only usedpandas.CSVDataset or pandas.ExcelDataset.
  4. To avoid No module named '_multiprocessing' I mocked out the parallel runner.
  5. By default the kedro pipeline runs fine and then just starts again and again once it's finished so will never get to the Vizro app part. I guess this might be because running the pipeline produces files, which is then detected as a filesystem change by PyCafe which causes the app to refresh or something like that? This happens even with "Save on Type" set to Off. To solve this I've just commented out all the output datasets in catalog.yml so the pipeline only runs twice (not sure why it's twice rather than once but it's not infinite now anyway 😅 ...) If you uncomment the output datasets in catalog.yml it will just execute on loop. @maartenbreddels do you know what's happening here?
antonymilne commented 4 hours ago

P.S. this isn't really a feature request but it didn't fit in any other category so there you are... I guess the feature request might be please make it easier to not use parallel runner, omegaconf or pre-commit-hooks but this is kind of an edge case I guess and it's possible already with some workarounds, so I don't think you need to do anything about it for my purposes anyway. Maybe it's another small data point for @astrojuanlu's considerations on how modular kedro should be though. The biggest problem here I think is the omegaconf dependency but if and when they release 2.4.0 that will be resolved for this particular setup anyway.

Mainly I just put this here as a report of what's currently possible since there was nowhere better to put it.

Edit: oh wait, I see that discussions are open now. Maybe I should have put it there. I leave it up to you to decide whether you want to move it anyway!