Running steps in process has large overhead

Marigold commented 1 year ago

We're currently running steps as processes through run_python_step function. Creating such process and importing all python libraries takes 2s or more (I've looked into it and it's really just loading libraries). This can be tested with

time python -c "from etl import helpers"

Below is a screenshot from tuna profiler by running

python -X importtime -c "from etl import helpers" 2> import.log
tuna import.log

I've also done profiling on time etl data://grapher/fasttrack/ --force --only comparing three approaches:

Without creating new process - 5s
Create new process with multiprocessing.Process - 17s
Create new process with subprocess.check_call - 26s

Running multiprocessing doesn't improve it that much.

Here are related issues

original issue which motivated us to isolate environments
PR with naive implementation of isolated environments that would improve the speed, but wouldn't be as clean

larsyencken commented 1 year ago

@Marigold I had a go at reducing import overhead: https://github.com/owid/etl/pull/1234

I haven't yet benched it properly across real-world tasks, that would be the thing to see.

Marigold commented 1 year ago

@larsyencken nice! Approved. What do you think about switching to multiprocessing (-35% in theory)? Or even going back from processes to isolated modules without process. That would reduce overhead basically to zero, though we wouldn't be able to set limits on memory (which is something we might not need anymore with our huge server or we could add the limit to the etl command itself)

It'd be interesting to ask data managers how is latency important for them. Avoiding ~3s overhead is nice for development for me, but perhaps it's less important for them (since they work mostly in notebooks).

To complicate things even more 😄 , we could control whether to use process or modules with env variable (like DEBUG which would also set --workers 1 and perhaps turn on log verbosity which I often miss).

larsyencken commented 1 year ago

Honestly, now that we have a large graph, I think we should offer parallel execution of the DAG.

Steps that want to parallelise should have the option to share the common process pool, rather than creating their own. That would bring down build times massively for the nightly build, and make working with larger datasets with many tables much nicer I think.

Within those processes, they'd only tackle one job at a time, and then I think we could use your isolated-modules solution safely.

Thoughts?

Marigold commented 1 year ago

Yep, I was also thinking about parallelisation for some time. I don't think it'd make work of data managers any faster, but it could reduce times for both nightly build and for normal builds (especially when editing heavy dependency like regions).

It's probably not gonna be hard to implement - just add process pool, queue for steps and continuously sending steps with finished dependencies. It sounds like fun exercise, let me know if you want to tackle it yourself 😉 .

larsyencken commented 1 year ago

Hey! On the parallel ETL, I took a stab before then put it down. A few notes on where I got to there, for when you look at it.

The dream approach

My dream API would have used a process pool, then passed the same pool to the run() method of the step

That way the overall DAG is parallel, but each step can also choose to be parallel
However, I ran into the issue that ProcessPoolExecutor doesn’t seem to be something you can pass as an arg to another process
It also appears that Papermill might spawn a process for any ipynb steps, making that impossible for those cases

It’s possible to salvage that approach by having two types of run() methods

e.g. if there is a run_mp(), pass it the pool; if there is just run(), spawn it to run in the pool

On the other hand, there's the risk of just making everything too complex. So I abandoned it for now in favour of parallelising between steps (not within steps).

Parallelisation strategies

It emerges that the built-in graphlib.TopologicalSorter() is designed to make spawning tasks from the DAG as soon as they're ready easy (see the second example), but it would take a little refactoring
Theoretically you can also just spawn the whole dag to the pool in topological order, and have each task explicitly wait for its dependencies
- It's not as quite as good as doing it from the DAG directly, but it might 95% there
- I tried this approach, but it had bug that I haven't resolved, feel free to skim/adapt/discard it

https://github.com/owid/etl/pull/1331

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Marigold commented 1 year ago

I've added back isolated environment enabled with the DEBUG=1 flag. Reducing the overhead is really useful for "live reloading" of data pages when working on metadata.

larsyencken commented 10 months ago

Made better this cycle, as of https://github.com/owid/etl/pull/2038, via parallel improvements

owid / etl

Running steps in process has large overhead #1229

The dream approach

Parallelisation strategies