Open eric-czech opened 4 years ago
I saw this issue pointed to from https://github.com/dask/dask-blog/pull/38 . Some small comments
We need to know how to use Dask at scale
FYI I'm making a company around this question. Let me know if you want to chat or be beta testers for cloud deployment products.
Figuring out what is going on with it in https://h2oai.github.io/db-benchmark/ would be a good exercise
They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.
Benchmarks are hard to do honestly.
Hey Matt,
Let me know if you want to chat or be beta testers for cloud deployment products.
Will do, but deployment isn't a big concern quite yet.
They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.
Good to know! It will definitely be helpful to see how we could get to that conclusion with task stream monitoring. Performance with .persist()
(I assume that's what they're doing based on your description) isn't particularly interesting for us so I'm not worried about the actual times so much as being a better user. Do you happen to know if there is a dask performance report for what they did somewhere?
Not to my knowledge. The "loading from disk" behavior would be evident in the dashboard by orange/red memory bars in the memory plot as well as lots of orange bars showing up in the task stream (red is network transfer, orange is disk transfer).
On Tue, Apr 28, 2020 at 5:49 AM Eric Czech notifications@github.com wrote:
Hey Matt,
Let me know if you want to chat or be beta testers for cloud deployment products.
Will do, but deployment isn't a big concern quite yet.
They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.
Good to know! It will definitely be helpful to see how we could get to that conclusion with task stream monitoring. Performance with .persist() (I assume that's what they're doing based on your description) isn't particularly interesting for us so I'm not worried about the actual times so much as being a better user. Do you happen to know if there is a dask performance report for what they did somewhere?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/related-sciences/gwas-analysis/issues/20#issuecomment-620585082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTB3NG3RT3M7AFKV53TRO3GEXANCNFSM4MHHA56A .
This issue tracks several more specific issues related to working towards a usable prototype.
Some things we should tackle for this are: