Research database for transaction filters

Did a simple PoC with solr, using the following as the schema

{
  "id": "123456789000000000",
  "timestamp": 123456789000000000,
  "entity_ids": [3, 98, 1567890]
}

id is the unique key, note there is a limitation from solr that unique key can't be of point type (for instance, LongPointType), so have to resort to add a string id field.
entity_ids file type is plongs, i.e., array of longs.
both timestamp and entity_ids are indexed. timestamp has to be indexed so we can sort the result on it

The initial testing is done with 3 shards, each with 2 replicas, the cluster has 3 solr nodes and each has up to 8 cores and 12G memory.

Both ingestion at 10ktps and top-k query for specific entity id worked well when the collection size is relatively small in the low hundreds of millions of documents: ingestion averaged at around 700ms, top-k query is around 300ms.

However, there is a clear trend of increasing query time when the collection size grows, at around 700 million documents, it increased to ~8 seconds.

The same testing is redone with 6 solr nodes, 6 shards and 3 replicas, while the query time is better, it still shows the slowdown trend as the collection size grows.

The conclusion is solr is not a scalable solution for tens of billions documents.

hashgraph / hedera-mirror-node

Research database for transaction filters #8101

Problem

Solution

Alternatives