Feature request: remote compute server (proving k2pow)

tjb-altf4 commented 3 weeks ago

Request to allow offloading of k2pow to a separate server. This problem was partially solved by the 1:N post-service feature, however it still relies upon compute being executed where storage exists or relying on suboptimal network share performance.

The idea of this feature is to allow separation of concerns between a low power storage server such as an Intel N100 and a high power compute node, such as a gaming computer or dedicated ex-enterprise server. This would provide security to the network, but lower electricity costs for smeshers.

As background information, this feature was added to h9-miner earlier in the year. I would like to see the feature introduced for official software to reduce incentive to move to h9-miner and help support "free range" netspace.

poszu commented 3 weeks ago

Hi @tjb-altf4, thanks for sharing the idea. I'm adding ideas for a possible high-level design below.

Requirements:

The default should remain unchanged. I.e. the k2pow should be calculated in the post-service if not configured otherwise. This is the most basic setup and we should not require any additional steps to use it and we should not break existing setups.
It should be possible to configure the post-service (via config/CLI) to use an external service (possibly GRPC) for calculating the k2pow.
The k2pow-service should persist calculated k2pows (e.g. in an embedded KV store) to avoid redoing the heavy work after crash/restart etc.
The k2pow might take a considerable time to finish. The post-service should not keep a long-running GRPC connection to the k2pow-service but it should prompt for a result in some short intervals.
The k2pow-service should optimally use CPU resources by using many cores and not running PoW in parallel.

The interface between the node and the post-service would remain unchanged. That is, the node requests a PoST proof generation from the post-service and prompts for a result in some intervals. The post-service creates the k2pow according to its configuration (either by calculating it or requesting it from the external service). It then continues with PoST proving.

tjb-altf4 commented 3 weeks ago

Thanks for responding @poszu, great to see requirements coming together already.

I’d suggest considering whether the k2pow service should manage and maintain its own work queue. If I understand correctly, you’re planning to manage k2pow state locally on this service, it would be beneficial to then utilise and extend this state management to also queue incoming workloads rather than running them all in parallel.

I believe this additional requirement would have the benefit of also resolving another Smesher UX improvements item (Simple orchestrator).

poszu commented 3 weeks ago

I’d suggest considering whether the k2pow service should manage and maintain its own work queue.

Yes, it totally makes sense to queue requests for calculating the k2pow as this is a heavily CPU-hungry task. I updated the requirements.

pigmej commented 6 days ago

I'd probably vote, though, for "no queue" at all and some sensible error back.

Why?

Then, it's easier to set some "auto scaling" feature before the service itself (just based on requests), the requirement of keeping the state locally just in case is valid.

acud commented 6 days ago

Hi there :wave:

@poszu I'm trying to figure out exactly which code needs to be pulled out to a separate service. Am I correct with the understanding that the PoW implementation in src/pow is the one that needs to be encapsulated with a service?

The k2pow might take a considerable time to finish. The post-service should not keep a long-running GRPC connection to the k2pow-service but it should prompt for a result in some short intervals.

Does this mean that we should assign a GUID for every job that is queued up to calculation? we should then probably also support create/get/delete operations for every job, correct?

Also, Re: grpc:

could the new service just live as a cargo workspace?
could the grpc contracts live in this codebase? or do we need to go through the api repo?

The k2pow-service should optimally use CPU resources by using many cores and not running PoW in parallel.

Would it be possible to have a bit more specifics here? I'm not sure I fully understand this in practice. RandomX doesn't really give the ability to configure anything around the CPU AFAIU. How would you envision this optimal resource usage?

Then, it's easier to set some "auto scaling" feature before the service itself (just based on requests), the requirement of keeping the state locally just in case is valid.

How would you know how much to scale the service? Also, if you assume that the service is load-balanced, all requests to execute anything must be blocking calls (otherwise how would the caller know how to land the call back on the same node?) which isn't clear on how to do. Maybe some more specifics here would help.

pigmej commented 6 days ago

About scaling.

If we mark worker as busy (not accepting new requests while doing already all allowed k2pow) then simple http2 LB will suffice. To make sure that "there is enough of usage".
Then sure some central place to query results from in big setups will be needed (Redis like), but in simple setups the original will be good enough as we can generate gid with worker prefix :)

And then we gain two things: You can use existing tooling easy in clustered setups, as all you need to do in k8s for example is some horizontal scaler.

If we however do the queueing on the worker side then we need to implement all signaling for full queue, etc. Imo unnecessary complication.

So to summ up I think best overall is:

client side retry logic with retry& timeout (short intervals, a lot retries, small jitter)
worker accepts only one client, accepts it, generates gid, returns gid, starts computing, finishes with result, and is ready to serve result for given GID and at the same time accept new computation request. So there will be only one computation and many "results" served at the same time.
workers should generate gid with some sensible prefix so one can do LB that could direct the requests for a particular gid based on the prefix.

pigmej commented 6 days ago

Would it be possible to have a bit more specifics here? I'm not sure I fully understand this in practice. RandomX doesn't really give the ability to configure anything around the CPU AFAIU. How would you envision this optimal resource usage?

We allow up to X CPU threads for randomX computation. It's called workers in this codebase, one worker is one CPU thread.

@poszu checked and on some CPUs it makes no sense to use more than Y cores as then it's even "not faster anymore" because of CPU architecture. Plus afaik it was also per numa group.

acud commented 5 days ago

So to summ up I think best overall is:

client side retry logic with retry& timeout (short intervals, a lot retries, small jitter)

worker accepts only one client, accepts it, generates gid, returns gid, starts computing, finishes with result, and is ready to serve result for given GID and at the same time accept new computation request. So there will be only one computation and many "results" served at the same time.

workers should generate gid with some sensible prefix so one can do LB that could direct the requests for a particular gid based on the prefix.

Right, so my understanding here that the solution is already opinionated about how it will be used with a load-balancer. I'm just trying to make sure because it just sounds like it's not gonna be self-contained. I.e., if you want to use the k2pow in an external service configuration which will have more than one instance/worker, then you'll have to build an external service with a specific load balancer configuration and maybe other levels of tooling that would get the results and store them on redis. I guess that the expectation is that the users would build that tooling? Or are we going to offer a complete solution?

pigmej commented 5 days ago

I think for now we can assume that it's good enough to have one instance that knows all.

I queue makes it simple we can try (we can always make queue with size 1) and reply "queue full"

But in general running more than one k2pow (randomX) per CPU (not thread, not cores) will not make it faster likely even slower.

Even with the queue we don't need to delete I guess.

The other possibility would be to make some MQ logic or take some off the shelf like nats and do proper fanin-fanout. But sounds like overkill in the first iteration:)

poszu commented 3 days ago

@poszu I'm trying to figure out exactly which code needs to be pulled out to a separate service. Am I correct with the understanding that the PoW implementation in src/pow is the one that needs to be encapsulated with a service?

Yes, probably into a separate crate to avoid pulling in GRPC- and CLI-related dependencies into the library. Similiarily as the certifier and post services are done.

The k2pow might take a considerable time to finish. The post-service should not keep a long-running GRPC connection to the k2pow-service but it should prompt for a result in some short intervals.

Does this mean that we should assign a GUID for every job that is queued up to calculation? we should then probably also support create/get/delete operations for every job, correct?

No need for a GUID. The set of input parameters (nonce_group, challenge, difficulty, miner_id) can be used to identify the proving request. I think a simple API to start proving and returning IN PROGRESS|FINISHED(u64)|FAILED(message) would suffice.

Also, Re: grpc:

could the new service just live as a cargo workspace?

Yes, a separate crate (look above for why).

could the grpc contracts live in this codebase? or do we need to go through the api repo?

I think it keep the proto files in the api repo similarly to the post-service (https://github.com/spacemeshos/api/blob/master/post/v1/service.proto) .

The k2pow-service should optimally use CPU resources by using many cores and not running PoW in parallel.

Would it be possible to have a bit more specifics here? I'm not sure I fully understand this in practice. RandomX doesn't really give the ability to configure anything around the CPU AFAIU. How would you envision this optimal resource usage?

The K2pow prover uses rayon to parallelize computing RandomX hashes for multiple nonces in parallel. This is a CPU-heavy task and there is no point in using more threads than CPU cores (or configurable value). I think that the best approach would be to run 1 PoW using all cores at a time (and decide whether to queue the other incoming requests or reject them with "try again later" (UNAVAILABLE status perhaps? See: https://grpc.io/docs/guides/status-codes).

spacemeshos / post-rs

Feature request: remote compute server (proving k2pow) #343