Horizontal scaling across multiple nodes

jaeksoft / opensearchserver

Open-source Enterprise Grade Search Engine Software

http://www.opensearchserver.com

Apache License 2.0

499 stars 190 forks source link

Horizontal scaling across multiple nodes #1916

Open lausycampari opened 5 years ago

lausycampari commented 5 years ago

Is it possible to scale the crawler module and/or search module across multiple computers, all concurrently operating on the same data set? (similar to Elasticsearch, for example). If not, a work-around would be to mount a networked file-system, and set that as the data-path, but would this cause any problems with the software that you're aware of (besides the obvious increase in read/write latency)?

ROBERT-MCDOWELL commented 5 years ago

i'm also interested by this question...

jelutz77 commented 5 years ago

I'm pretty sure that this would be compatible with the idea of a Federated search, such as Elasticsearch. The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again. There are a couple of protocols out there that fail to do this effectively, or fail to assign weights to different aspects of a page, losing much of the information in HTML.

jelutz77 commented 5 years ago

Another approach to this issue would be to separate servers based on their functionality. The part of the system that is absolutely critical to keep all together is the web site metadata, so keeping a separate database server would be the first part to this solution. Another server or multiple servers could do crawling and feed the database via network access. And another server could perform web functions, such as supply a web interface for users (possibly a shared server), or access the database via API.

ROBERT-MCDOWELL commented 5 years ago

The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again why not sharedObjects?

ROBERT-MCDOWELL commented 5 years ago

The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again Why not SharedObjects?
And another server could perform web functions, such as supply a web interface for users (possibly a shared server), or access the database via API. Maybe the concept of cluster would be more effecient, use a UDP protocol (like a DNS server), to share instantly everything new or modified, the sharedObjects will analyze the part to change so will pass to the stream only the new bytes or modified bytes

jelutz77 commented 5 years ago

Sharedobjects? This is the first I’ve heard of that. It’s generally better to use simpler, or more efficient, or more mainstream software rather than the more novel idea unless there is some new feature of the newer idea that adds measurable value. I’m not familiar with this structure so I don’t have a reason to use Sharedobjects.

jelutz77 commented 5 years ago

A cluster? Databases can be clustered, and copy data between nodes synchronously or asynchronously, sort of how I understand Sharedobjects work. However, the amount of data involved would make keeping a copy on each search or web server impractical. Besides, a single dedicated database server would easily be able to handle the transaction load by itself for a sizable cluster of web servers. Existing clustering configurations for a dedicated database cluster can further expand scalability to dozens or hundreds of web servers.

ROBERT-MCDOWELL commented 5 years ago

well, nothing is impossible in a digital world. How do you think FB or else can manage their DB amon hundreds of thousands of servers? Shared Objects is a 10 years old feature, more recent in JS, but exists in Java, Actionscript, etc..

ROBERT-MCDOWELL commented 5 years ago

I also forgot the torrent protocol, can also be interesting to explore