Open dgrnbrg opened 5 years ago
Hi @dgrnbrg! When you say in parallel, are you running multiple regressions on a summarize call, i.e.,
df.summarize([regression1, regression2...])
Or
Are you calling df.summarize(regressionX)
with multiple python thread?
Hey @icexelloss! I am running multiple (20-ish) regressions on a summarize call. I found that it's very fast if I run 4-6 regressions per call, but the performance hits a cliff at some point. This is also on full calls to summarize
, so I don't think it's a streaming windows thing.
Hello, I am using many regressions in parallel over a single call to
summarize
. I've noticed that if I run ~20 regressions on a dataset with 5M rows, it seems to take 45-60 minutes to summarize. If I run a single regression on a similarly-sized dataset, however, it only takes a minute or two to summarize. What kinds of performance characteristics should I expect, and how can I avoid this kind of performance collapse?Thank you!