pyr / cyanite

cyanite stores your metrics
http://cyanite.io
Other
446 stars 79 forks source link

Avoid fetching id since we always know it in advance #261

Closed ifesdjeen closed 7 years ago

ifesdjeen commented 7 years ago

Should address #257

I'm not completely happy with the code just yet since we have two levels of mapcats now, so posting it only for information for now.

pyr commented 7 years ago

The pmap approach is supposed to be faster than an IN query?

ifesdjeen commented 7 years ago

It may be slightly faster, but it was rather unintended. I've completely refactored it, so will push soon. The main advantage is that we're not downloading the id for each column. I was just unsure if not using pmap will result into sequential access, but I still haven't confirmed that. Cassandra returns resultsets that are futures, but I'm not sure what exactly alia does to it further and if we're getting a seq future or block/wait, will have to dig in.

mpenet commented 7 years ago

alia doesn't do anything special regarding async/futures, it just merely wraps the driver, so you can do both: either block and get results sequentially via execute (buffered/streamed via laziness), or do scather/gather like queries with separate execute-async (via simple callbacks or the core.async variant using alts!), there's an example of that with core.async in the readme, it's quite straightforward.

That said pmap might yield better results in synthetic benchmark but it runs queries in an unbounded threadpool (it limits parallelism per call, but not for all calls globally, since it calls clojure.core/future internally), so it might actually be dangerous imho. Usually the rule of thumb for pmap is not to use it unless for playing in the repl, at least that's my intuition.

That said I wonder what's the cost of fetching the ID + IN. I guess if you could try partitioning the IDs in small batches and run parallel queries, measure and see if that's a good option, but I doubt it's far from the original approach.

ifesdjeen commented 7 years ago

Thanks for the details Max! I haven't done actual benchmarks, either. I hope to get around to do it tonight. I'll then implement it via call that'd wait for all the results and then merge them together. Gut feeling tells me it has to be much better than the current (master) approach, but the numbers will show the truth. I'll post the benchmark details and improved code.

pyr commented 7 years ago

I'm uncomfortable with a pmap based approach.

ifesdjeen commented 7 years ago

I agree, pmap was a bad idea from the beginning. I should've been less lazy and just have implemented it the right way through execute-async. @pyr you think that might be better?

ifesdjeen commented 7 years ago

I'll close this PR and redo it as @mpenet suggested after reworking engine.