750k search results in 16 hours but disk activity is too high + ~ 50GB total disk usage on 5 instances

gigablast / open-source-search-engine

Nov 20 2017 -- A distributed open source search engine and spider/crawler written in C/C++ for Linux on Intel/AMD. From gigablast dot com, which has binaries for download. See the README.md file at the very bottom of this page for instructions.

Apache License 2.0

1.54k stars 441 forks source link

750k search results in 16 hours but disk activity is too high + ~ 50GB total disk usage on 5 instances #122

Open unisqu opened 7 years ago

unisqu commented 7 years ago

750k search results in 16 hours but disk activity is too high + ~ 50GB total disk usage on 5 instances on same computer with the instances distributed in 2 HDDs...

is this normal? the disk activity is too high. I do have 24GB Ram for use though on Ubuntu desktop.

The disks are always making the CPU load ~ around 50. I have 4 cores CPU running here.

turbo commented 7 years ago

This is normal disk activity. The privacore fork of this engine uses a different method of merging data which reduces IOPs, but it's not going to be much less. A search engine is basically a glorified map-reduce task, and these need high IOPs.

Your best bet is to get a few enterprise SAS SSDs that don't do excessive garbage collection. You can either rent such a server or buy it. The minimum would be a few SATA SSDs. The best would be a RAID of NVMe PCIe SSDs (like these servers). However it would be better for scalability to use GCE, AWS or linode/SSDnode virtual servers.

unisqu commented 7 years ago

how can i reduce iops on this? what's a good way to ensure iops is reduced? i've cut down the number of crawlers to 50. how do i calculate iops usage? what is actually happening...

turbo commented 7 years ago

You can't reduce the iops without rewriting code that uses the disk. The admin panel should give you and extensive overview of the timing of each operation (write, fetch, merge). This allows you to optimize your settings, but isn't a cure-all or even guaranteed to have any effect whatsoever. Checkout the privacore branch for an example of how to rewrite for lower iops. You could also inquire about the Pro version of Gigablast.

unisqu commented 7 years ago

I've read privacore, that's not very elegant in terms of merging. it's too complicated.

where can i find out about the pro version of gigablast?

unisqu commented 7 years ago

The way my current disks are being hammered, how long does the disks last running on gigablast opensource?

unisqu commented 7 years ago

I'm almost hitting 1 mil record items in a day now. 1 mil a day is very slow but makes me wonder how many crawlers google are using daily.