Closed steven-sheehy closed 4 months ago
Did a simple PoC with solr, using the following as the schema
{
"id": "123456789000000000",
"timestamp": 123456789000000000,
"entity_ids": [3, 98, 1567890]
}
id
is the unique key, note there is a limitation from solr that unique key can't be of point type (for instance, LongPointType), so have to resort to add a string id
field.entity_ids
file type is plongs
, i.e., array of longs.timestamp
and entity_ids
are indexed. timestamp
has to be indexed so we can sort the result on itThe initial testing is done with 3 shards, each with 2 replicas, the cluster has 3 solr nodes and each has up to 8 cores and 12G memory.
Both ingestion at 10ktps and top-k query for specific entity id worked well when the collection size is relatively small in the low hundreds of millions of documents: ingestion averaged at around 700ms, top-k query is around 300ms.
However, there is a clear trend of increasing query time when the collection size grows, at around 700 million documents, it increased to ~8 seconds.
The same testing is redone with 6 solr nodes, 6 shards and 3 replicas, while the query time is better, it still shows the slowdown trend as the collection size grows.
The conclusion is solr is not a scalable solution for tens of billions documents.
Problem
In https://github.com/hashgraph/hedera-mirror-node/issues/7860, pg_roaringbitmap extension was explored. We should also look at standalone search databases to see if the provide a better off the shelf experience.
Solution
Alternatives
No response