Open pvzed opened 8 years ago
for most skeches, yes. But some sketches (e.g. CountMin) require the weight to be positive.
for stream of 5000 elements where there are 1000 distinct. what should be w for zero mistake? I want to understand how does the mistake is less when w grows.
Are you trying to count the number of distinct elements in a data stream? Sorry but I haven't implemented distinct counter in this project (will add later).
You may want to take a look at my C++ implementation,
basically, if you set the buffer size as w, you will get an estimation of the true number with relative error 1/sqrt{w}. See section 3 of this tutorial for detail (only if you are interested in).
BTW, I am considering re-implement this project using Cython while keep the api as similar as the current one, not sure if I have time recently.
I don't want to have count of distinct elements, I know I have 1000 different elements in my stream. I want to know what is the total weight(count) of each element in my stream, which includes 5000 points and 1000 different keys.
what would be the right w for zero mistake? it seems that for w = 1000 there are mistakes. Is that possible?
2016-02-05 2:01 GMT+02:00 jiecchen notifications@github.com:
Are you trying to count the number of distinct elements in a data stream? Sorry but I haven't implemented distinct counter in this project (will add later).
You may want to take a look at my C++ implementation,
- project page: StreamingCC https://github.com/jiecchen/StreamingCC
- api docs for distinct counter: DistinctCounter http://xmerge.me/StreamingCC-api/classScc_1_1DistinctCounter.html
basically, if you set the buffer size as w, you will get an estimation of the true number with relative error 1/sqrt{w}. See section 3 of this tutorial http://www.cs.dartmouth.edu/%7Eac/Teach/CS49-Fall11/Notes/lecnotes.pdf for detail (only if you are interested in).
BTW, I am considering re-implement this project using Cython while keep the api as similar as the current one, not sure if I have time recently.
— Reply to this email directly or view it on GitHub https://github.com/jiecchen/StreamLib/issues/1#issuecomment-180111727.
Thanks alot for the reply!
2016-02-05 2:08 GMT+02:00 Boyko Anton antoooonnn@gmail.com:
I don't want to have count of distinct elements, I know I have 1000 different elements in my stream. I want to know what is the total weight(count) of each element in my stream, which includes 5000 points and 1000 different keys.
what would be the right w for zero mistake? it seems that for w = 1000 there are mistakes. Is that possible?
2016-02-05 2:01 GMT+02:00 jiecchen notifications@github.com:
Are you trying to count the number of distinct elements in a data stream? Sorry but I haven't implemented distinct counter in this project (will add later).
You may want to take a look at my C++ implementation,
- project page: StreamingCC https://github.com/jiecchen/StreamingCC
- api docs for distinct counter: DistinctCounter http://xmerge.me/StreamingCC-api/classScc_1_1DistinctCounter.html
basically, if you set the buffer size as w, you will get an estimation of the true number with relative error 1/sqrt{w}. See section 3 of this tutorial http://www.cs.dartmouth.edu/%7Eac/Teach/CS49-Fall11/Notes/lecnotes.pdf for detail (only if you are interested in).
BTW, I am considering re-implement this project using Cython while keep the api as similar as the current one, not sure if I have time recently.
— Reply to this email directly or view it on GitHub https://github.com/jiecchen/StreamLib/issues/1#issuecomment-180111727.
@pvzed Hi, the randomized sketches (such as CountMin, CountSketch) are designed for processing large data stream. Say you have 1 billion distinct elements, and the length of your data stream can be as long as 100 billion. Maintaining a count for each element is not practical and you only care about the estimation of count (say you are monitoring the network, really you do not care about the exact number of visitors, a rough estimation is enough).
The sketch gives you a way to use a very small space (say 10000 bytes), given any query (an element), it returns the estimation of its total count.
As for your question, they can not guarantee zero mistake, they will introduce small relative error. If you care about the exact count, randomized sketch is not a good choice.
Yeah I should write some tutorial for people who are not familiar with randomized sketches. Those data structures were developed in recent 10 years and haven't be widely deployed in industry products.
You may want to take a look at the wiki of Count-Min, but many other sketches do not have wiki pages.
Thanks again!! 2 Questions: 1)What about MG algorithm? I see that you have implemented this but it's not DONE>
2)I'm running tests on your implementation to see the mistake. I define "diff": a vector of differences between the sum of the sketch result and the real result. The mistake(or cost) will be norm2 of this vector. However, what is "buffer size"? how many numbers are stored in memory if the buffer size is w? BTW I changed typecode to be 'f' instead of 'i'
I'm doing of all this because I'm testing an algorithm which should give better results than the sketches,the problem is that I'm not familiar with the theory of the field of sketches to compare it well. It guarantees a better accuracy and less memory.
Here is an example for a graph of 3 of your DONE algs(F2 is not included): x axis is w, y is the cost/mistake. Data set is 1000 elements,where there are 1000 distinct. each element has weight from 0 to 1. Do you think the result are good? http://srv2.jpg.co.il/2/56b50b6a655f0.png
@pvzed For you questions:
BTW, this lecture notes also include theory of many other sketches/streaming algorithms. If you set buffer size as w, the total space will be O(w log n) where n can be considered as the possible length of the data stream. typically log n < 30.
I am very interested in what you were saying
" I'm testing an algorithm which should give better results than the sketches,"
are you writing a research paper? I know some results improved the CountMin, CountSketch, e.g. STOC'16. If what you claimed is true, it could be a excellent result.
It guarantees a better accuracy and less memory.
Can you give me your email?
2016-02-05 23:42 GMT+02:00 jiecchen notifications@github.com:
For you questions:
- MG algorithm is used to find the frequent items in a data stream. See Sec 1.2 of this http://www.cs.dartmouth.edu/%7Eac/Teach/CS49-Fall11/Notes/lecnotes.pdf lecture notes.
BTW, this lecture notes also include theory of many other sketches/streaming algorithms. If you set buffer size as w, the total space will be O(w log n) where n can be considered as the possible length of the data stream. typically log n < 30.
- This figure looks reasonable. But please keep in mind only when the data stream is very long and the number of distinct elements are reasonable large, skeches can outperform traditional algorithms.
I am very interested in what you were saying
" I'm testing an algorithm which should give better results than the sketches,"
are you writing a research paper? I know some results improved the CountMin, CountSketch, e.g. STOC'16 http://arxiv.org/abs/1511.00661. If what you claimed is true, it could be a excellent result.
It guarantees a better accuracy and less memory.
— Reply to this email directly or view it on GitHub https://github.com/jiecchen/StreamLib/issues/1#issuecomment-180571279.
@pvzed jiecchen AT indiana.edu
Thanks.