kakao / s2graph

This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one
https://github.com/apache/incubator-s2graph
Other
250 stars 32 forks source link

support random sampling query results #190

Closed HyunsungJo closed 8 years ago

HyunsungJo commented 8 years ago

Taking suggestions from @SteamShon and @ummae into account, I've ran benchmark tests on three variations of sampling, and decided to go with RNG Sampling. (Although I doubt that shuffling and taking n will have any realistic affect given that the number of queried edges is usually capped to thousands for service use.)

Please refer to SamplingBenchmarkSpec.scala for some of the sampling implementations that were tested.

daewon commented 8 years ago

:+1: @HyunsungJo

HyunsungJo commented 8 years ago

@ummae

Please say YES!!!!

ummae commented 8 years ago

@HyunsungJo :satisfied: o..ok!

djfwan commented 8 years ago

👍

ummae commented 8 years ago

@HyunsungJo closed. blazingly fast... lol I'm so sorry but I lov ... okay, let me give a chance once more to tackle you(...) I'd write Python code, Just for fun.. ;)

import time
import random

N = 10000
edges = [i for i in xrange(N)]

def sample(edges, n):
    idxs = set()
    while len(idxs) < n:
        idx = random.randint(0, n)
        if idx not in idxs:
            idxs.add(idx)
    return [edges[i] for i in idxs]

print 'RNG'
for n in [10, 100, 500, 1000, 5000, 8000]:
    start_t = time.time()
    sample(edges, n)
    end_t = time.time() - start_t
    print 'Draw %s elems from %s array: %s elapsed' % (n, N, end_t)

print 'SHUFFLE-TAKE'
for n in [10, 100, 500, 1000, 5000, 8000]:
    start_t = time.time()
    random.shuffle(edges)
    edges[:n]
    end_t = time.time() - start_t
    print 'Draw %s elems from %s array: %s elapsed' % (n, N, end_t)

RNG
Draw 10 elems from 10000 array: 5.60283660889e-05 elapsed
Draw 100 elems from 10000 array: 0.00070595741272 elapsed
Draw 500 elems from 10000 array: 0.00501108169556 elapsed
Draw 1000 elems from 10000 array: 0.0201549530029 elapsed
Draw 5000 elems from 10000 array: 0.107923030853 elapsed
Draw 8000 elems from 10000 array: 0.156128168106 elapsed
SHUFFLE-TAKE
Draw 10 elems from 10000 array: 0.00440907478333 elapsed
Draw 100 elems from 10000 array: 0.00439596176147 elapsed
Draw 500 elems from 10000 array: 0.00436115264893 elapsed
Draw 1000 elems from 10000 array: 0.0101671218872 elapsed
Draw 5000 elems from 10000 array: 0.005038022995 elapsed
Draw 8000 elems from 10000 array: 0.00643491744995 elapsed

What if user want to draw many from many? (or accidentally, anyway even though it is not intended usages of this feature..)

In my coarse experiment and you did, we can see that RNG sampler is faster than others only if N is small enough. I think in most of most case sampling size will be small, but there is no reason to preventing such corner case. what do you think?