Open Smallhi opened 4 years ago
The current repository is only a demonstration, not really meant to be used in production. That is, someone could start from this code, or other related code, and make something general useful out of it... and bring it to the core library.
In the Go version of Roaring bitmaps... BSI was included as part of the core library. The same could be done within the core Roaring library.
I think that several folks have worked quite a bit on BSI in the Java/Roaring context (me included as you have observed). But before we bring it to the core library, we have to make sure that the code is sound and can be deployed in production. It implies some care with respect to testing and documentation.
What it takes is some entreprising programmers...
cc @richardstartin
Thanks very much for you reply!
I've implemented rangeEq/range and sum for BSI. TopK not yet! And I'm not sure the implemention can be use generally
In our prod env, we split RoraringBitmap by fixed length width for distributed computing, since the sinlge RoaringBitmap is more than 100M(the input id is not continuous). In this scenario, if we merge topK when reduce, I think the result might not be precise. but I have no way to solve this problem.
And for OLAP storage eg Arrow, druid, parquet etc. They all use vector computing(sometimes use SIMD), but for BSI storage, I think the Iterative calculation after vectorization cost is huge. I've researched this question for a long time, but get nothing....
@Smallhi If you can, I would encourage you to try and prepare a pull request. Keep things small (at first) and include tests and documentation.
@lemire I don't think I can find time to do this myself
I don't think I can find time to do this myself
That is not what I was proposing. Rather I was pinging you because you have relevant expertise and experience.
@lemire I'll do this. And I'd like pull request after I deploy it on our prod env and make sure the performace is ok. So I might need one month. Is that ok?
Sure.
@lemire I'll do this. And I'd like pull request after I deploy it on our prod env and make sure the performace is ok. So I might need one month. Is that ok?
最近也在看BSI,不知道你们那里进展咋样?
@Smallhi can you share an update on your progress?
@lemire I've finished the basic functions for BSI in java according to Golang RoaringBitmap lib.
so, until now, I didn't commit the pr yet.
@qiuqingyu if you need, I can shard the experimental code for you.
@lemire I've finished the basic functions for BSI in java according to Golang RoaringBitmap lib.
- construct a BSI by min/max int value
- parallel compare ( EQ/NEQ/GreatThan/GreatEqual/LessThan/LessThan)
- transpose
- merge But I haven't got chance to validate these functions in production env for some reasons....
so, until now, I didn't commit the pr yet.
@qiuqingyu if you need, I can shard the experimental code for you.
太感谢了,如果可以的话非常感谢能把代码分享给我,希望能通过代码来更好的理解BSI的原理
@lemire I'm gonna commit a pr to RoaringBitmap lib. Could u help me review the design doc?
How do we express bit slice index structure? BitSliceIndex can be use Pair<Integer,Integer> compression for real dataset. Given that we hava a Table T, which have columns(rowid,c1,c2...cn), we can compress any (rowId,c1) which satify rowId and c1 are Integer. For Integer c1 (32 bits),we can check each bit component. if the bit turn on, we put the index to slice which store using RoaringBitmap. All slices, we use an array to store; And aslo, if rowId hava value, we put this rowId into ebM( exist bitmap) .
Why do we need three implements for bitsliceindex inferface? As the UML figure show,we have 3 different implements for bitsliceindex.
In real production env, Given that we have 10 Billion users (columnId, cid), each user hava an integer value. when we use BitsliceIndex to store these key-value data. Each slice will be more than 120MB.If the max value is Integer.MAX, BitSliceIndex could be more than 120MB*33 = 3960MB, the serialization and deserialization will be expensive. On the other hand, for OLAP,we often use compare/sum/topK API,which needn't update the BitsliceIndex in memory. So we can only map the slice data to memory using ImmutableRoaringBitmap, which is more efficient. But for OLTP, we can use MutableBitSliceIndex to update, but I'm not sure how can we handle OOM problem. In our prod env, we shard and use ditributed system to deal with this problem.
but for function in() which given a Set
@Smallhi This looks good.
If you have unsigned arithmetic, you can always 'almost' make signed arithmetic work if you take the two complement's path. It is the great thing about two complement's.
BSI is a great way when we use to store high cardinality column for OLAP. the bitmap index such as druid, arrow, parquet is limited .... BSI can be use for point/range query and sum/topK