Closed Marigold closed 1 year ago
@Marigold I had a go at reducing import overhead: https://github.com/owid/etl/pull/1234
I haven't yet benched it properly across real-world tasks, that would be the thing to see.
@larsyencken nice! Approved. What do you think about switching to multiprocessing (-35% in theory)? Or even going back from processes to isolated modules without process. That would reduce overhead basically to zero, though we wouldn't be able to set limits on memory (which is something we might not need anymore with our huge server or we could add the limit to the etl
command itself)
It'd be interesting to ask data managers how is latency important for them. Avoiding ~3s overhead is nice for development for me, but perhaps it's less important for them (since they work mostly in notebooks).
To complicate things even more 😄 , we could control whether to use process or modules with env variable (like DEBUG
which would also set --workers 1
and perhaps turn on log verbosity which I often miss).
Honestly, now that we have a large graph, I think we should offer parallel execution of the DAG.
Steps that want to parallelise should have the option to share the common process pool, rather than creating their own. That would bring down build times massively for the nightly build, and make working with larger datasets with many tables much nicer I think.
Within those processes, they'd only tackle one job at a time, and then I think we could use your isolated-modules solution safely.
Thoughts?
Yep, I was also thinking about parallelisation for some time. I don't think it'd make work of data managers any faster, but it could reduce times for both nightly build and for normal builds (especially when editing heavy dependency like regions).
It's probably not gonna be hard to implement - just add process pool, queue for steps and continuously sending steps with finished dependencies. It sounds like fun exercise, let me know if you want to tackle it yourself 😉 .
Hey! On the parallel ETL, I took a stab before then put it down. A few notes on where I got to there, for when you look at it.
My dream API would have used a process pool, then passed the same pool to the run() method of the step
It’s possible to salvage that approach by having two types of run()
methods
run_mp()
, pass it the pool; if there is just run()
, spawn it to run in the poolOn the other hand, there's the risk of just making everything too complex. So I abandoned it for now in favour of parallelising between steps (not within steps).
graphlib.TopologicalSorter()
is designed to make spawning tasks from the DAG as soon as they're ready easy (see the second example), but it would take a little refactoringThis issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I've added back isolated environment enabled with the DEBUG=1
flag. Reducing the overhead is really useful for "live reloading" of data pages when working on metadata.
Made better this cycle, as of https://github.com/owid/etl/pull/2038, via parallel improvements
We're currently running steps as processes through
run_python_step
function. Creating such process and importing all python libraries takes 2s or more (I've looked into it and it's really just loading libraries). This can be tested withBelow is a screenshot from tuna profiler by running
I've also done profiling on
time etl data://grapher/fasttrack/ --force --only
comparing three approaches:multiprocessing.Process
- 17ssubprocess.check_call
- 26sRunning multiprocessing doesn't improve it that much.
Here are related issues