hashgraph / hedera-mirror-node

Hedera Mirror Node archives data from consensus nodes and serves it via an API
Apache License 2.0
124 stars 108 forks source link

Research database for transaction filters #8101

Closed steven-sheehy closed 4 months ago

steven-sheehy commented 5 months ago

Problem

In https://github.com/hashgraph/hedera-mirror-node/issues/7860, pg_roaringbitmap extension was explored. We should also look at standalone search databases to see if the provide a better off the shelf experience.

Solution

Alternatives

No response

xin-hedera commented 4 months ago

Did a simple PoC with solr, using the following as the schema

{
  "id": "123456789000000000",
  "timestamp": 123456789000000000,
  "entity_ids": [3, 98, 1567890]
}

The initial testing is done with 3 shards, each with 2 replicas, the cluster has 3 solr nodes and each has up to 8 cores and 12G memory.

Both ingestion at 10ktps and top-k query for specific entity id worked well when the collection size is relatively small in the low hundreds of millions of documents: ingestion averaged at around 700ms, top-k query is around 300ms.

However, there is a clear trend of increasing query time when the collection size grows, at around 700 million documents, it increased to ~8 seconds.

The same testing is redone with 6 solr nodes, 6 shards and 3 replicas, while the query time is better, it still shows the slowdown trend as the collection size grows.

The conclusion is solr is not a scalable solution for tens of billions documents.