twosigma / flint

A Time Series Library for Apache Spark
Apache License 2.0
1k stars 184 forks source link

Regression/summarizer performance collapse? #74

Open dgrnbrg opened 5 years ago

dgrnbrg commented 5 years ago

Hello, I am using many regressions in parallel over a single call to summarize. I've noticed that if I run ~20 regressions on a dataset with 5M rows, it seems to take 45-60 minutes to summarize. If I run a single regression on a similarly-sized dataset, however, it only takes a minute or two to summarize. What kinds of performance characteristics should I expect, and how can I avoid this kind of performance collapse?

Thank you!

icexelloss commented 5 years ago

Hi @dgrnbrg! When you say in parallel, are you running multiple regressions on a summarize call, i.e.,

df.summarize([regression1, regression2...])

Or

Are you calling df.summarize(regressionX) with multiple python thread?

dgrnbrg commented 5 years ago

Hey @icexelloss! I am running multiple (20-ish) regressions on a summarize call. I found that it's very fast if I run 4-6 regressions per call, but the performance hits a cliff at some point. This is also on full calls to summarize, so I don't think it's a streaming windows thing.