lemire / BitSliceIndex

Experiments on bit-slice indexing
Apache License 2.0
13 stars 8 forks source link

could you merge this lib to RoraringBtimap? #1

Open Smallhi opened 4 years ago

Smallhi commented 4 years ago

BSI is a great way when we use to store high cardinality column for OLAP. the bitmap index such as druid, arrow, parquet is limited .... BSI can be use for point/range query and sum/topK

lemire commented 4 years ago

The current repository is only a demonstration, not really meant to be used in production. That is, someone could start from this code, or other related code, and make something general useful out of it... and bring it to the core library.

In the Go version of Roaring bitmaps... BSI was included as part of the core library. The same could be done within the core Roaring library.

I think that several folks have worked quite a bit on BSI in the Java/Roaring context (me included as you have observed). But before we bring it to the core library, we have to make sure that the code is sound and can be deployed in production. It implies some care with respect to testing and documentation.

What it takes is some entreprising programmers...

cc @richardstartin

Smallhi commented 4 years ago

Thanks very much for you reply!

I've implemented rangeEq/range and sum for BSI. TopK not yet! And I'm not sure the implemention can be use generally

In our prod env, we split RoraringBitmap by fixed length width for distributed computing, since the sinlge RoaringBitmap is more than 100M(the input id is not continuous). In this scenario, if we merge topK when reduce, I think the result might not be precise. but I have no way to solve this problem.

And for OLAP storage eg Arrow, druid, parquet etc. They all use vector computing(sometimes use SIMD), but for BSI storage, I think the Iterative calculation after vectorization cost is huge. I've researched this question for a long time, but get nothing....

lemire commented 4 years ago

@Smallhi If you can, I would encourage you to try and prepare a pull request. Keep things small (at first) and include tests and documentation.

richardstartin commented 4 years ago

@lemire I don't think I can find time to do this myself

lemire commented 4 years ago

I don't think I can find time to do this myself

That is not what I was proposing. Rather I was pinging you because you have relevant expertise and experience.

Smallhi commented 4 years ago

@lemire I'll do this. And I'd like pull request after I deploy it on our prod env and make sure the performace is ok. So I might need one month. Is that ok?

lemire commented 3 years ago

Sure.

qiuqingyu commented 3 years ago

@lemire I'll do this. And I'd like pull request after I deploy it on our prod env and make sure the performace is ok. So I might need one month. Is that ok?

最近也在看BSI,不知道你们那里进展咋样?

lemire commented 3 years ago

@Smallhi can you share an update on your progress?

Smallhi commented 3 years ago

@lemire I've finished the basic functions for BSI in java according to Golang RoaringBitmap lib.

so, until now, I didn't commit the pr yet.

@qiuqingyu if you need, I can shard the experimental code for you.

qiuqingyu commented 3 years ago

@lemire I've finished the basic functions for BSI in java according to Golang RoaringBitmap lib.

  • construct a BSI by min/max int value
  • parallel compare ( EQ/NEQ/GreatThan/GreatEqual/LessThan/LessThan)
  • transpose
  • merge But I haven't got chance to validate these functions in production env for some reasons....

so, until now, I didn't commit the pr yet.

@qiuqingyu if you need, I can shard the experimental code for you.

太感谢了,如果可以的话非常感谢能把代码分享给我,希望能通过代码来更好的理解BSI的原理

Smallhi commented 3 years ago

@lemire I'm gonna commit a pr to RoaringBitmap lib. Could u help me review the design doc?

image

In real production env, Given that we have 10 Billion users (columnId, cid), each user hava an integer value. when we use BitsliceIndex to store these key-value data. Each slice will be more than 120MB.If the max value is Integer.MAX, BitSliceIndex could be more than 120MB*33 = 3960MB, the serialization and deserialization will be expensive. On the other hand, for OLAP,we often use compare/sum/topK API,which needn't update the BitsliceIndex in memory. So we can only map the slice data to memory using ImmutableRoaringBitmap, which is more efficient. But for OLTP, we can use MutableBitSliceIndex to update, but I'm not sure how can we handle OOM problem. In our prod env, we shard and use ditributed system to deal with this problem.

but for function in() which given a Set values, we will return all coulmnIds using Roaringbitmap. We only have multi-thread version now.

lemire commented 3 years ago

@Smallhi This looks good.

If you have unsigned arithmetic, you can always 'almost' make signed arithmetic work if you take the two complement's path. It is the great thing about two complement's.