owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
79 stars 21 forks source link

Running steps in process has large overhead #1229

Closed Marigold closed 1 year ago

Marigold commented 1 year ago

We're currently running steps as processes through run_python_step function. Creating such process and importing all python libraries takes 2s or more (I've looked into it and it's really just loading libraries). This can be tested with

time python -c "from etl import helpers"

Below is a screenshot from tuna profiler by running

python -X importtime -c "from etl import helpers" 2> import.log
tuna import.log
image

I've also done profiling on time etl data://grapher/fasttrack/ --force --only comparing three approaches:

  1. Without creating new process - 5s
  2. Create new process with multiprocessing.Process - 17s
  3. Create new process with subprocess.check_call - 26s

Running multiprocessing doesn't improve it that much.

Here are related issues

larsyencken commented 1 year ago

@Marigold I had a go at reducing import overhead: https://github.com/owid/etl/pull/1234

I haven't yet benched it properly across real-world tasks, that would be the thing to see.

Marigold commented 1 year ago

@larsyencken nice! Approved. What do you think about switching to multiprocessing (-35% in theory)? Or even going back from processes to isolated modules without process. That would reduce overhead basically to zero, though we wouldn't be able to set limits on memory (which is something we might not need anymore with our huge server or we could add the limit to the etl command itself)

It'd be interesting to ask data managers how is latency important for them. Avoiding ~3s overhead is nice for development for me, but perhaps it's less important for them (since they work mostly in notebooks).

To complicate things even more 😄 , we could control whether to use process or modules with env variable (like DEBUG which would also set --workers 1 and perhaps turn on log verbosity which I often miss).

larsyencken commented 1 year ago

Honestly, now that we have a large graph, I think we should offer parallel execution of the DAG.

Steps that want to parallelise should have the option to share the common process pool, rather than creating their own. That would bring down build times massively for the nightly build, and make working with larger datasets with many tables much nicer I think.

Within those processes, they'd only tackle one job at a time, and then I think we could use your isolated-modules solution safely.

Thoughts?

Marigold commented 1 year ago

Yep, I was also thinking about parallelisation for some time. I don't think it'd make work of data managers any faster, but it could reduce times for both nightly build and for normal builds (especially when editing heavy dependency like regions).

It's probably not gonna be hard to implement - just add process pool, queue for steps and continuously sending steps with finished dependencies. It sounds like fun exercise, let me know if you want to tackle it yourself 😉 .

larsyencken commented 1 year ago

Hey! On the parallel ETL, I took a stab before then put it down. A few notes on where I got to there, for when you look at it.

The dream approach

My dream API would have used a process pool, then passed the same pool to the run() method of the step

It’s possible to salvage that approach by having two types of run() methods

On the other hand, there's the risk of just making everything too complex. So I abandoned it for now in favour of parallelising between steps (not within steps).

Parallelisation strategies

https://github.com/owid/etl/pull/1331

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Marigold commented 1 year ago

I've added back isolated environment enabled with the DEBUG=1 flag. Reducing the overhead is really useful for "live reloading" of data pages when working on metadata.

larsyencken commented 10 months ago

Made better this cycle, as of https://github.com/owid/etl/pull/2038, via parallel improvements