near / queryapi

Near Indexing as a Service
17 stars 3 forks source link

Replace Index File Backfill with Bitmap Indexer Backfill #744

Closed darunrs closed 2 months ago

darunrs commented 4 months ago

Fully replace Index file backfill with Bitmap query backfill. This involves switching out the delta lake get_matching_block_heights function call with the BlockHeightStream function of the same name. The yield of the BlockHeightStream call is a single block height instead of a full list, so some behavior of backfill could change, potentially for the better. In addition, the first iteration of the new backfill may have some performance problems. Thus, the performance needs to be evaluated and any less complex improvements should be made (Such as preloading the next day's bitmaps while yielding the current day).

darunrs commented 3 months ago

Leaving Context here regarding https://github.com/near/queryapi/pull/803 before I go on a 2 week Leave.

I've migrated the code to use the new production indexer. I reran tests and it all still passes. There's a couple things I have not done explicitly which I'd like to do for manual testing, before releasing to dev.

There's various callouts of unused code by Cargo, which are definitely false. For example, that the new() constructor for CompressedBitmap, or StreamExt. Not sure what to do about those.

For next steps, the main performance problem is that we only query graphql next when we exhaust the current stream of block heights. I would like to prefetch the next stream while yielding the current one. In many cases, the yielding of the block will happen extremely fast, so there's wait time between block dates. Otherwise, it's mainly optimizing the indexer for space or query speed.

For a quick explanation of the algorithm: We store base64 encoded bitmaps in the indexer. Decoded, they look like this: [Sign Bit][EG of SIgn Bit][EG of Other Bit][EG of Sign Bit]... An example would be: 1 00101 0001110

We encode the EG for compressing consecutive bits down. Essentially encoding X number of Y value bits instead of having Y bits X times. The Elias Gamma of a single bit is the bit itself. So no compression is done. Otherwise, it takes the following form: [N number of 0s]1[binary of remainder, of length N]. If there are 10 consecutive 1s for example, we find the largest power of 2 which fits inside the number. In this case its 8 so the power is 3. We write three zeros: 000. Then we write a 1, to indicate the 0s portion of the EG is done: 0001. Finally, we enncode the remainder (10 - 8 = 2) as binary, padded left such that the length of the binary is N. So, in this case, its 010. Combined, we get 000 1 010. This is done repeatedly for the other bit value until the end is reached.

The first bit in the bitmap is special as it merely encodes what sign the first elias gamma encodes. The remaining EGs can be guessed as to what they encode by counting them. We do explicitly write what bit value the last EG is encoding as when we want to append a bit to the end of the bitmap, we can decompress only the last EG instead of all of them.