valeriansaliou / sonic

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
https://crates.io/crates/sonic-server
Mozilla Public License 2.0
20.08k stars 578 forks source link

Bulk index speed is slow #290

Open amirEBD opened 2 years ago

amirEBD commented 2 years ago

Hi everyone at @sonic, I was looking for ElasticSearch's alternatives because of resource usage issues on ES cluster. I found sonic so useful as an alternative. So decided to do some benchmarks on it and see how it will respond. The problem now I face is that bulk index insertion is too slow regarding other search engines like elasticsearch/zinc. I'm using go-sonic client to bulk some data inside sonic and it took about 1-2 hours to bulk below data! Should I change the client to nodeJS for example?

Data size: about 50MBs Doc count: 2M string as Text field

Note: Used the same config for sonic as it is on the github page Also used docker for sonic.

Thanks for any help indeed :)

PS: As a comparison I tested Elasticsearch for about 5GB of data in 1 hour.

valeriansaliou commented 2 years ago

You could try the NodeJS client: https://github.com/valeriansaliou/node-sonic-channel which is official and for which I've measured the performances to be rather good yes.

amirEBD commented 2 years ago

Thanks for your suggestion, I tried the node JS client but hadn't meet my expectations again!

Is there anyway to put lots data into sonic for tests reasons? No example code was is the node client github. Just one ingest.js which will send 1 push to server.

valeriansaliou commented 2 years ago

The NodeJS library would split the text data into sub-commands chunks, so that definitely works. Though you should maybe pre-split your data before pushing.

Sonic was built for chat messages indexing + email indexing at first, which is why everything is centered around small chunks of data.

In other words, it is intended that inserting 1M messages results in 1M+ commands (a bit more considering some messages are larger than the max chunk size, but the NodeJS library handles splitting for you, based on the server dynamically-provided buffer size).

valeriansaliou commented 2 years ago

In order to maximize speed, note that you should split the work between multiple NodeJS instances running the ingestion on multiple split of your data. Let's say you have 4 cores on the server running the ingestion script, then you'd split your data in 4 and run 4 NodeJS instances to push that data to Sonic. Because each ingestion thread can be seen as a synchronous command channel, blocking for a few micro-seconds at each PUSH command.

On the Sonic server end (on another server), to maximize ingestion speed, you should also ensure you have as many CPUs as there are data producer NodeJS instances (as Sonic spawns 1 thread per Sonic Channel opened over TCP by clients), + some spare CPUs for the RocksDB internal threads to do their work.

And also adjust your config.cfg accordingly to max out your Sonic server resources.

That way you can max out your importer server + Sonic server capacity. Also make sure everything is running on fast SSDs.

amirEBD commented 2 years ago

Thanks for your detailed explanation. I had test go-sonic with the simple push which showed a better performance.