Open lausycampari opened 5 years ago
i'm also interested by this question...
I'm pretty sure that this would be compatible with the idea of a Federated search, such as Elasticsearch. The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again. There are a couple of protocols out there that fail to do this effectively, or fail to assign weights to different aspects of a page, losing much of the information in HTML.
Another approach to this issue would be to separate servers based on their functionality. The part of the system that is absolutely critical to keep all together is the web site metadata, so keeping a separate database server would be the first part to this solution. Another server or multiple servers could do crawling and feed the database via network access. And another server could perform web functions, such as supply a web interface for users (possibly a shared server), or access the database via API.
Sharedobjects? This is the first I’ve heard of that. It’s generally better to use simpler, or more efficient, or more mainstream software rather than the more novel idea unless there is some new feature of the newer idea that adds measurable value. I’m not familiar with this structure so I don’t have a reason to use Sharedobjects.
A cluster? Databases can be clustered, and copy data between nodes synchronously or asynchronously, sort of how I understand Sharedobjects work. However, the amount of data involved would make keeping a copy on each search or web server impractical. Besides, a single dedicated database server would easily be able to handle the transaction load by itself for a sizable cluster of web servers. Existing clustering configurations for a dedicated database cluster can further expand scalability to dozens or hundreds of web servers.
well, nothing is impossible in a digital world. How do you think FB or else can manage their DB amon hundreds of thousands of servers? Shared Objects is a 10 years old feature, more recent in JS, but exists in Java, Actionscript, etc..
I also forgot the torrent protocol, can also be interesting to explore
Is it possible to scale the crawler module and/or search module across multiple computers, all concurrently operating on the same data set? (similar to Elasticsearch, for example). If not, a work-around would be to mount a networked file-system, and set that as the data-path, but would this cause any problems with the software that you're aware of (besides the obvious increase in read/write latency)?