Closed dougbrn closed 9 months ago
I've been poking around with this. It's feasible to replace lambda calls with functions and have that reflected deep in the dask profiler, but it looks like it's harder to consistently get something like this:
Where "mymeanfunction" (being applied via batch) shows up consistently as a top level task on the progress bar. The challenge is that the task label takes on the name of last dask function applied. So for a certain workflow, you can have something like this happen, where we do batch output standardization using Ensemble._standardize_batch
and the final operation applied is to convert to an EnsembleFrame, yielding a task label of "TapeFrame" here:
We can still make some progress by having functions show up deeper in the profiler, like a _apply_batch
function which can be found in the profiler:
But this is an incremental change that I'm not sure addresses the real crux of this ticket, which is I think wanting something like the first image? @hombit let me know if that's the case. Enabling something like the first image consistent would involve having our hands a lot deeper into dask delayed and the lower-level collections, and even then I'm not sure exactly what we can do. I think it's worth it to go ahead with the lambda replacement regardless because I think it reads better code-wise, but again I'm not sure it really would solve the core problem highlighted in this ticket.
Thank you for the investigation, @dougbrn, I haven't realized how Dask names stuff. I'm ok with closing this issue if it wouldn't help in profiling
Let's keep this open until we maybe have a conversation at a future meeting just to verify level of importance vs effort.
Overloaded lambdas lead to ambiguity in dask dashboard profiling, so we should replace these (starting with batch and sync_tables as the biggest users of it) with specifically-defined functions to be able to tell them apart. Raised by @hombit