jtgrassie / monero-pool

A Monero mining pool server written in C
BSD 3-Clause "New" or "Revised" License
344 stars 125 forks source link

Server stops responding for 10+ minutes each time there is a network seed hash upgrade (~2.8 days) #101

Closed gzz2000 closed 3 years ago

gzz2000 commented 3 years ago

As the title suggests, I have observed such an instability pattern in xmrvsbeast.com. By quickly looking at the code, I suspect it is because the randomx algorithm implementation keeps switching the current seed hash between the old one and the new one, due to mixed share submission at a short period of time after the update, each costing seconds to finish and render the server unusable. But I'm not sure if it is the right cause.

jtgrassie commented 3 years ago

Cannot reproduce. And as the pool uses the actual Monero implementation of rx_seedheight & rx_slow_hash, if you're sure (so have isolated/confirmed in gdb/lldb) that's your problem (and not some other server [mis]configuration), report in the Monero project. But as I said, I cannot reproduce what you're experiencing.

Although not reproduced, what you've explained makes some sense. E.g. on a slower machine with a high amount of miners submitting shares across a seed height boundary, the rx algorithm will be toggling.

gzz2000 commented 3 years ago

Cannot reproduce. And as the pool uses the actual Monero implementation of rx_seedheight & rx_slow_hash, if you're sure (so have isolated/confirmed in gdb/lldb) that's your problem (and not some other server [mis]configuration), report in the Monero project. But as I said, I cannot reproduce what you're experiencing.

Although not reproduced, what you've explained makes some sense. E.g. on a slower machine with a high amount of miners submitting shares across a seed height boundary, the rx algorithm will be toggling.

I think we can just make sure submissions to the old height get dropped (maybe recognized as invalid share? or just return OK but ignore them), as they no longer contribute to the mining process. This also lowers the number of orphaned blocks. I did not reproduce this myself, just observing it in the public pool for many times. There's no need to modify the randomx implementation in monero project though, as its responsibility is to compute the hashes correctly, even if it means toggling.

jtgrassie commented 3 years ago

I think we can just make sure submissions to the old height get dropped (maybe recognized as invalid share? or just return OK but ignore them), as they no longer contribute to the mining process. This also lowers the number of orphaned blocks.

I think that's a bad idea. You maybe throwing away a valid block (more work) that causes a reorg. Thus you still want to check the hash of recent heights/shares.

Better options would be run the pool on faster hardware, or possibly run the hash function in light mode (dataset init will be quicker but the hash function overall would be slower, thus careful measurement needed), or use multiple datasets.

I did not reproduce this myself

So this is just all hypothetical / guess work then?

gzz2000 commented 3 years ago

I think we can just make sure submissions to the old height get dropped (maybe recognized as invalid share? or just return OK but ignore them), as they no longer contribute to the mining process. This also lowers the number of orphaned blocks.

I think that's a bad idea. You maybe throwing away a valid block (more work) that causes a reorg. Thus you still want to check the hash of recent heights/shares.

Better options would be run the pool on faster hardware, or possibly run the hash function in light mode (dataset init will be quicker but the hash function overall would be slower, thus careful measurement needed), or use multiple datasets.

I did not reproduce this myself

So this is just all hypothetical / guess work then?

Yes they are guesses, as you said it needs a lot of miners submitting at once to reproduce which might only happen with a large pool. I looked at nodejs-pool and they just throw away old height shares so this might not be a problem. I'm closing this issue as of no enough evidence to convince.

jtgrassie commented 3 years ago

I looked at nodejs-pool and they just throw away old height shares so this might not be a problem.

Wrong. There is a circle buffer (pastBlockTemplates) of 4 past block templates. It throws away shares for which there is no template in the buffer at the job height. It would be wrong not to check recent shares for recent (not tip) height, as you could be throwing away valid blocks.

gzz2000 commented 3 years ago

I looked at nodejs-pool and they just throw away old height shares so this might not be a problem.

Wrong. There is a circle buffer (pastBlockTemplates) of 4 past block templates. It throws away shares for which there is no template in the buffer at the job height. It would be wrong not to check recent shares for recent (not tip) height, as you could be throwing away valid blocks.

ahyes I see. Then why they do not face outage problem at seed hash boundary... it's weird. You are right that we should not throw away old shares as they might become valuable. (maybe it is favorable to only do so at seed hash boundary, which does not happen frequently.) Anyway thanks for your explanation.

jtgrassie commented 3 years ago

ahyes I see. Then why they do not face outage problem at seed hash boundary... it's weird.

Well as we've established, you're guessing there's a problem with the pool implementation and guessing what's causing it.

maybe it is favorable to only do so at seed hash boundary, which does not happen frequently.

Maybe it's best to do nothing until a problem is found, isolated and thus is addressable?

I manage 2 large private pools and do not experience what you describe. That's not to say your guesses are wrong, just not enough information to act on.

gzz2000 commented 3 years ago

ahyes I see. Then why they do not face outage problem at seed hash boundary... it's weird.

Well as we've established, you're guessing there's a problem with the pool implementation and guessing what's causing it.

maybe it is favorable to only do so at seed hash boundary, which does not happen frequently.

Maybe it's best to do nothing until a problem is found, isolated and thus is addressable?

I manage 2 large private pools and do not experience what you describe. That's not to say your guesses are wrong, just not enough information to act on.

I agree.

jtgrassie commented 3 years ago

Monerod remains responsive however.

Except the log you showed has the monerod RPC returning a HTTP failure.

jtgrassie commented 3 years ago

Have a script running every minute to try to catch this condition, always returns 200 "curl -o /dev/null -s -w "%{http_code}\n" http://127.0.0.1:18081/json_rpc"

That doesn't even call an RPC method, so is a useless test.