Reactive solution to DaggerHashimoto low hashrate on one of GPUs - interaction benchmark, & missing fallback server domains

scscgit commented 4 years ago

Hi, I've had this hashrate issue of DaggerHashimoto with GPUs like RX580 for a very long time, and I haven't seen it addressed by anyone yet. It occurs in both Phoenix and ClaymoreDual in a similar way.

When running multiple miners at once, they seem to cause some kind of conflict with each other, and one of the two GPUs usually only produces between 4-11 Mh/s, while the other one has the correct 22-26 Mh/s. If I start either of them individually, then the issue doesn't occur. The most straightforward workaround is to simply choose a different algorithm for a second GPU, so there's no conflict (though sadly, the DaggerHashimoto is currently way more profitable than an alternative). However, it's often a matter of luck, as sometimes the miners manage to both start properly and produce the max hashrate.

My suggestion is to add a NiceHash feature that compares the current hashrate against a baseline, and attempts various hotfixes. For example, there's usually only one instance of a miner running both GPUs, but there could be an attempt to periodically restart the "second instance" while the hashrate is obviously too low. (There may also be some other well known solution.) Note that some users may even prefer to have their computer restart in this scenario.

Alternatively, NiceHash should also be able to handle such situations by automatically measuring if any two algorithms conflict with each other's hash rate, and switch algos by re-evaluating the benchmark, e.g. if the second instance provides only 0.012 instead of 0.035 mBTC/day, and the second best benchmark does 0.025, then it should simply switch. (My personal preference would be to periodically retry the original algo e.g. every few hours.)

By the way, the interaction could be benchmarked even more directly, as the dual algos like DaggerHashimoto +Decred (ClaymoreDual) adjust their parameters differently under those circumstances. For example, I've had a benchmark speed: 24.309 MH/s + 0.732 GH/s, but running it together with the other algo caused the result of ETH: GPU0 3.328 Mh/s, DCR - Total Speed: 99.827 Mh/s. This means that whenever a negative algo interaction gets detected, there's a possibility of re-benchmarking the others, because their parameters may adjust to hugely increase the profitability. (Plus it's weird how we can't even trust the "saved" dcri parameter, but I hope the auto-switching takes this into consideration.)

As a separate issue, this is also a very important concern from the auto benchmark perspective, because it's non-deterministic how some algos (each time it can be on a different GPU) measure e.g. 11 Mh/s instead of 24 Mh/s, Maybe an optional, experimental feature of "auto-benchmark every device alone" could aid some people here :)

Nevertheless, NiceHash currently lacks notifications even about issues as simple as hashrate not matching the benchmarked values (e.g. over 20% tolerance), but I assume that's on your TODO list, after all, it'd be easy to watch this even via cloud. When it comes to the new version 3.0.0.3, I'd also hope to see the USD profitability on a Devices tab. There are features from v1.9 that we've lost for no good reason. Btw. I'll also note that this algo interaction was also probably behind some BSODs, but that's out of scope of NiceHash - I'm just hoping that someone else could share their solutions if there are any.

//EDIT: Based on experimentation, it seems that a huge factor of instability could be Fan Tuning when using the latest Radeon Software. It's located under Performance/Tuning with Manual Tuning Control. Not only is this setting completely invisible under MSI Afterburner, such that you won't notice it's turned on until you go check it, but even worse, it always gets turned on after a computer restarts (due to a crash). As a result, once you get a first crash, you will get stuck in a crashing loop. Even though this fix seems to improve the stability for me, it doesn't fix it altogether. By the way, I've recently had even worse issues while I've been using MSI Afterburner's "user defined software automatic fan control", so if you have similar issues, check that too. (I hope NiceHash can implement some fan speed measures in the future, because nobody wants to damage their fans by spinning them at 100% speed, which usually happens under default fan curve.)
//EDIT 2: I've (accidentally) found out that I had another severe HW issue, specifically that my ATX 24-pin power cable had both its +12V connectors burnt out, causing inability to boot up PC after touching/pulling on the cable even a little bit. I'm not really sure if this were also the cause of my Fan-related issues, but my desktop seems a lot more stable with a new cable (e.g. no more CRITICAL_PROCESS_DIED BSODs). As a side-note, I would like to make sure if by any chance your developers know a way to detect such severe issues (e.g. last time I had a similar problem with a PSU, I used software like HWiNFO64 and noticed a drop in a related PSU voltage metric). If there is an option like this, I'd like to suggest that you could integrate a similar troubleshooting feature at some point in a future. I'm sure there will be many other users who would benefit from such debugging. I've also found out that the mining instability of one of my two AMD GPUs (often running only at around 30% of the max. performance) got fixed by switching the DisplayPort monitor cable into an intel iGPU instead of AMD. As long as I connect monitor to an AMD card, that card's performance rapidly drops down.

There's also been an issue with your server, specifically the XMRig has a zero hashrate and says: [randomxmonero.eu.nicehash.com:3380] read error: "end of file" Considering there are locations [eu, usa, hk, jp, in, br] specified at nicehash.com/algorithm, I would very much want to ask why there are no fallback servers specified in the client. This should switch automatically. Not only will we lose profit whenever this happens in the future, switching the location in a client makes us prone to forget to switch it back later on :)

almartins commented 4 years ago

Hi, Very good point because I'm seen that happen and no solution from nicehash in new versions.

jwesolo commented 4 years ago

@scscgit I'm responding to your server problem only (though I like the suggestions you made). The "end of file" is what usually seems to happen when there are no orders for a particular algorithm. Usually it resolves almost immediately when new orders come in. Does the problem persist for a long time?

I've had a ton of issues with DaggerHashimoto server errors recently, though. This is when running dual Ethash/Easglesong with both GMiner and NBMiner (I believe it also happened with Phoenix mining Ethash only but not certain). Usually takes about 20 attempts before it connects. I was getting the "Malformed server message" error over and over again. This isn't the place to report server issues, though, so I didn't create an issue for it.

scscgit commented 4 years ago

@jwesolo I've had the issue for several hours up to maybe a day, but it has resolved itself already (I've switched the server setting in the meantime). I don't think there were no orders at that moment, but in that case NiceHash should also notify us (to switch the algo). I haven't had DaggerHashimoto server errors though.

By the way, today I've noticed that even my otherwise stable desktop (with older AMD driver, which supports my old atikpatcher unlike the latest driver) dropped from 18+ to 8 Mh/s, so I've had to manually restart the miner to get both GPUs back to the max hashrate. Once again, NiceHash should track this kind of issues and restart the miner automatically.

StevenWiner commented 4 years ago

I had the same issue running 2x RX580s and the fix was to go into the Radeon software and change both GPU's to Compute mode as opposed to Graphic mode. Radeon software defaulted to Graphics mode on 1 GPU.

nicehash / NiceHashMiner

Reactive solution to DaggerHashimoto low hashrate on one of GPUs - interaction benchmark, & missing fallback server domains #1956