related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

Build PyData prototype for GWAS analysis #20

Open eric-czech opened 4 years ago

eric-czech commented 4 years ago

This issue tracks several more specific issues related to working towards a usable prototype.

Some things we should tackle for this are:

mrocklin commented 4 years ago

I saw this issue pointed to from https://github.com/dask/dask-blog/pull/38 . Some small comments

We need to know how to use Dask at scale

FYI I'm making a company around this question. Let me know if you want to chat or be beta testers for cloud deployment products.

Figuring out what is going on with it in https://h2oai.github.io/db-benchmark/ would be a good exercise

They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.

Benchmarks are hard to do honestly.

eric-czech commented 4 years ago

Hey Matt,

Let me know if you want to chat or be beta testers for cloud deployment products.

Will do, but deployment isn't a big concern quite yet.

They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.

Good to know! It will definitely be helpful to see how we could get to that conclusion with task stream monitoring. Performance with .persist() (I assume that's what they're doing based on your description) isn't particularly interesting for us so I'm not worried about the actual times so much as being a better user. Do you happen to know if there is a dask performance report for what they did somewhere?

mrocklin commented 4 years ago

Not to my knowledge. The "loading from disk" behavior would be evident in the dashboard by orange/red memory bars in the memory plot as well as lots of orange bars showing up in the task stream (red is network transfer, orange is disk transfer).

On Tue, Apr 28, 2020 at 5:49 AM Eric Czech notifications@github.com wrote:

Hey Matt,

Let me know if you want to chat or be beta testers for cloud deployment products.

Will do, but deployment isn't a big concern quite yet.

They first load the entire dataset in RAM. Pandas doesn't store string data efficiently. As a result Dask is often spilling to disk during those benchmarks, which is why it's slow. We encouraged them to just include the time to read data from disk rather than starting from memory, but the maintainers of the benchmark said that that would be unfair.

Good to know! It will definitely be helpful to see how we could get to that conclusion with task stream monitoring. Performance with .persist() (I assume that's what they're doing based on your description) isn't particularly interesting for us so I'm not worried about the actual times so much as being a better user. Do you happen to know if there is a dask performance report for what they did somewhere?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/related-sciences/gwas-analysis/issues/20#issuecomment-620585082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTB3NG3RT3M7AFKV53TRO3GEXANCNFSM4MHHA56A .