Optimize contender selection

Overview

Resolves https://github.com/namoray/nineteen/issues/70

The main idea behind the employed approach was to begin leveraging historical data to serve the best possible contenders for a given task. To achieve this goal, a dual approach was adopted, using both historic data related to the given contender itself as well as data about similar contenders. Also, wanted to add some seasoning to the system.

Approach & Methodology

Before we begin explaining how the process works itself, we have to consider the underlying problem we are trying to solve which is the problem of data scarcity - i.e. how to make data-driven decisions in the potential absence of high-quality data? This effectively means that we need to develop strategies which allow us to use the data immediately available to us for the given contender, while still being able to give sane recommendations in situations where such data isn't available.

When we initially boot up the system and no data is available, we cannot, by definition, make truly data-driven decisions. In this case, the newly implemented approach behaves in the exact same way as before, meaning that it "load balances" tasks across contenders based solely on the amount of requests which were process by a given contender beforehand.

If we have historical data concerning a given contender, we aggregate it to form a metric of the behavior of the contender. It is aggregated as an exponentially decaying sum, with the terms falling of by a given factor in specified time intervals. If contenders were absent during certain intervals, their score is treated as 0 for those intervals. The resulting metric is then multiplied by a tunable weight factor and subtracted from the original rank (the number of processed requests), resulting in a potential re-ranking of the contenders.

Finally, this PR assumes that there is a degree of similarity between models. Consider, for instance, models which are members of a family such as llama. There are many different models which are available in a given family, many of which vary only in the weights (i.e censored vs uncensored models). It is safe to assume that, if a given miner node is effective at serving out a requests to one of these models, it is most likely capable of serving out most (if not all) of them. While we can assume that nodes running more powerful models can safely run smaller ones as well, we can't assume the opposite. To capture this use-case as well, the similarity metrics are designed to be potentially asymmetric.

Now, assume that a node is serving out multiple models in parallel i.e. there are multiple contenders running on a single node. The system now allows admins to register "task similarities", expressed as a scalar value between 0 and 1. An absence of a task similarity metric between two tasks is treated as a 0. When calculating the overall metric for a given contender, the system takes into account not only the metrics associated with the contender itself but also with all similar contenders running on the same node. The final, global metric of a contender is calculated as the sum of the metric generated using the data of the contender itself, as well as the weighted metrics of all similar contenders.

Edge Cases

This strategy involves the following edge cases:

No historical data - the system sorts the contenders based on the number of processed requests and acts as a load balancer
Historical data available only for a given contender - only uses data for the given contender, effectively behaving as if no similar contenders exist
Historical data available for given and similar contenders - aggregates data globally and generates global metric based on available data
Historical data available only for similar contenders - if no data is available for the given contender, the system will only use the data of the similar contenders when calculating metrics. Likely to be encountered in cold-start scenarios

Test examples

In order to test the system, a new Dockerfile, docker-compose and orchestration script were implemented. To test the system, simply run:

sh run_test.sh

Before running the test itself, please make sure to have an empty database since the test container will delete all the rows in the tables it uses.

This test will create all infrastructure services and start the test container. The container follows the flow of the actual query_node itself, simulating the process of creating and storing new nodes, tasks and contenders. To effectively unit test this functionality, database seeding and destruction operations were implemented.

In each epoch, the system will create 1 new contender with each of the provided tasks running on a separate node. Finally, in the last epoch, the system will create a new type of contender to simulate the cold start scenario, were no historical data is available for a contender, but data for similar contenders is.

At the end, the program will print out the ranked contenders, along with the values of the global aggregated metrics. Please note that even though that model CHAT_LLAMA_3_1_800B has been added in the very last iteration (and thus hasn't got any historical data) still has a non-zero metric owing to the fact that a related model CHAT_LLAMA_3_2_3B has available metrics and is considered 50% similar to the new model. In fact, the resulting metric should be close to half of the CHAT_LLAMA_3_2_3B metric.

namoray / nineteen