kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.89k stars 898 forks source link

skip tests on macos because `multiprocess` default switch to `spawn` since Python 3.8 #3705

Closed noklam closed 6 months ago

noklam commented 6 months ago

Description

Fix #3702

Development notes

After long investigation, I found that the issue is closely related to #3704. This wasn't a problem because in older version, MacOS was default with "fork" process until Python 3.8.

This is not ideal as there are no way to run these tests locally as most core team member is using MacOS, alternative is debug this on GitPod as a temporary solution. However, there are no way to make these tests work unless there is a clever way to make mocking work with multi-process. I spend an hour trying and cannot find anything, so I stopped here until we prioritise this.

image

Tests passed as expected because I simply skip the test now.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

ElenaKhaustova commented 6 months ago

Tested on my side (MacOS):

noklam commented 6 months ago

@ElenaKhaustova That's expected and not caused by this PR, I think @AhdraMeraliQB has a PR on it.

tagging @merelcht to review this since we deal with the multiprocessing thing recently.

noklam commented 6 months ago

@merelcht not completely, at least pipeline are runnable, but it seems to break hooks at least (which is still quite bad).

I have open a draft PR to add some documentation about the use of configure_project originally, but I think we now need a ticket to look at this properly.

https://github.com/kedro-org/kedro-viz/issues/1801

ElenaKhaustova commented 6 months ago

@ElenaKhaustova That's expected and not caused by this PR, I think @AhdraMeraliQB has a PR on it.

tagging @merelcht to review this since we deal with the multiprocessing thing recently.

I do not think I had it before - just checked that it came with recent main updates, not the current one.

noklam commented 6 months ago

We had a discussion and it's properly best to start with fixing https://github.com/kedro-org/kedro-viz/issues/1801. The problem is not Kedro specific, but rather "thread"/"process" safety in Python.

There are idea to provide some same hook class (or like AbstractDataset), it's unclear how it would look like so we will tackle the viz one as an example and see how it goes.

On the other hand, kedro run --runner ***** shows up on HEAP but the number is not significant, I dump a question in Slack to see if there are anyone using it, or is there a good reason not to use it.