nlnwa / gowarcserver

Apache License 2.0
14 stars 2 forks source link

Allow distribution of gowarcservers with a "parent->child" relationship #4

Closed Avokadoen closed 3 years ago

Avokadoen commented 3 years ago

Based on meeting with @maeb. He had an idea of a potential direction to improve gowarcserver.

Is your feature request related to a problem? Please describe. This will solve two problems.

  1. In the loke GUI you can see all the different collections on the main page. If you have a series of collections, then it can be cumbersome to find a given warc record as you have to be aware which collection has the record or manually search each collection
  2. Optimize gowarcserver by distributing indexing and searching

Describe the solution you'd like

Gowarcserver network diagram

We can structure gowarcservers like a tree. Each node in the tree can hold records and N child nodes. Using arguments or editing the config should allow you to point at child nodes of the gowarcserver that is being fired up. When the server receive a query it should process the query while also ask all children to do the same. How it should handle finding results is left undefined for now i.e discarding request to children and just send found item or wait for all children to answer before aggregating result etc. It's important to note that based on the diagram, the only difference between a parent- and leaf node is that the leaf node has no registered children. Programmatically they should be identical.

Problem 1 will be solved by introduction of the concept of a parent-child relation. It will allow us to set up a network of servers where a root instance can aggregate queries throughout the gowarcserver network. Loke will only have to know about the root. This will result in the end user not having to care about which collection that contains the target record.

Problem 2 will be solved by the fact that queries can be aggregated using go routines to children and self which should make queries scale with increased data. Indexing of records will also be distributed without locking it to a topic or area (i.e all indexing of newspapers having to be central)

It's worth noting that this will introduce greater complexity to the codebase and abusing said tree structure might lead to slower results as request will be chained based on tree depth.

This will also open up future optimizations. Examples of this could be: caching common queries where no changes has been made in the db or skipping nodes when we already know target node for query.

Additional context Googles talk about about go servers (mainly from slide 33 and out) Potential API http://timetravel.mementoweb.org/guide/api/

Avokadoen commented 3 years ago

Implementation idea: To avoid implementations that locks containers into a single pod. The containers should communicate using http(s) for requests. The reason for this is that it delivers security. Containers do not need to know more than their children's url which allows for sandboxing, if required.

I'm not very familiar with kubernetes so my initial idea to implement this is to have a static url that points to a api that describes the whole deployed hierarchy. Kubernetes would have to initialize by spinning up this service first. The nodes should handle being spawned before the service i.e by polling or using some sort of mechanism in kubernetes. When a node is spawned it requests its children from the service and then the service notifies the node parent about the new child (the service knows the api for the parent node and so can just send it to the child url to the parent node). Authentication is important here to avoid hijacking attacks. For our use the risk of this is low, but should be accounted for anyways. This would a single point of failure design though ...

As stated above my knowledge about kubernetes is limited, so kubernetes might have all or some of the functionality for this.

Resources to learn more:

Avokadoen commented 3 years ago

Standup MVP: The node network only has to be configured though kubernetes configs. The simples solution then would be to use a environment variable with child urls

Also: veidemann-cache has similar behavior to what is needed

Avokadoen commented 3 years ago

Another resource: https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-networking-guide-beginners.html