Rmode causing crashes every 3 hours on random gpus. 12 navi/big navi

ghost commented 2 years ago

Screenshot_20220528-120435_Hive OS

So no matter what I do it just keeps crashing with R-mode whereas my smaller rigs do not. I'm saving about 50w and gaining some mh/s. Noticed that gpus drop wattage pretty low from time to time when mining now with rmode (60% of normal usage). Prior to rmode was no issues. Tried many different oc on all cards. Random Gpu dead regardless of what i do after about 3 hours.

ghost commented 2 years ago

This on v0.10. Never tried rmode on the previous version

ghost commented 2 years ago

Screenshot_20220602-071711_Hive OS It's absolutely inexplicable and it's always just a random gpu. I have 4 6800s 1 6800xt 1 6700xt 6 5700 or 5700xts. And 1 3080 on this rig. Everything is tuned correctly and if I put the 6800s (xt excluded ofc) in A mode then they get 60 mh/s while the navi10s are running in R mode. Much like they used to run without fclk boosts or until trm added better support for these gpus on Linux. I'm also running rmode with no issues on smaller rigs. R mode imo just doesn't work properly with larger rigs OR possible a mix of different cards... but the latter makes 0 sense so I think it just sucks with larger rigs for now 😆. My settings are really good. I just boosted voltage up a little from typical rmode settings to see if it does anything but it doesn't. Even changing all cards to stock albeit some undervolting, they will still repeatedly crash every few hours without any given reason.

GYKrauss commented 2 years ago

Also having a problem in a mixed system with an NVIDIA GPU trying to run lolminer. Crashes all day, does not work.

Daniel140220 commented 2 years ago

I am having the same random GPU dead errors on all my RX580 and RX6600XT rigs with v0.10.0. And I cannot see any replies from TRM team! Very frustrating. Previous versions were stable.

ghost commented 2 years ago

Yeah I mean I would at least like to work with them to fix this issue and provide more data. Or have it at least addressed, I know this isn't a one off. I never had any issues prior to v0.10 and always recommend trm to anyone using amd. I'm fairly confident in saying it's a software issue. Or a mix of something that our rigs have combined with the software triggering it. Like shit I'll let them have access to my hive so we can work it out and they can investigate further without having to replicate the problem themselves to dev up a solution.

DeafEyeJedi commented 2 years ago

Did you guys reduce the cores down by 150-200?

Sent from my iPhone

On Jun 3, 2022, at 2:52 PM, vynstersquash @.***> wrote:

Yeah I mean I would at least like to work with them to fix this issue and provide more data. Or have it at least addressed, I know this isn't a one off. I never had any issues prior to v0.10 and always recommend trm to anyone using amd. I'm fairly confident in saying it's a software issue. Or a mix of something that our rigs have combined with the software triggering it. Like shit I'll let them have access to my hive so we can work it out and they can investigate further without having to replicate the problem themselves.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

ghost commented 2 years ago

Yes, and I've personally tried tons of different combinations which all gave the same result. It's been 3 hours so I'm due for a crash any minute now actually 😆

blackmennewstyle commented 2 years ago

In my experience, it just mean your overclocking numbers are not stable, also check out memory temperature close the 3 hours :)

Daniel140220 commented 2 years ago

But they were perfectly stable before the 0.10 update. I was never getting the DEAD GPU warning before! And why are we getting these errors 6 or 12 hours after rig re-start?

My conclusion is that TRM 0.10 needs a fix.

From: Cédric CRISPIN @.> Sent: June 17, 2022 3:21 AM To: todxx/teamredminer @.> Cc: Daniel140220 @.>; Comment @.> Subject: Re: [todxx/teamredminer] Rmode causing crashes every 3 hours on random gpus. 12 navi/big navi (Issue #618)

In my experience, it just mean your overclocking numbers are not stable, also check out memory temperature close the 3 hours :)

— Reply to this email directly, view it on GitHub https://github.com/todxx/teamredminer/issues/618#issuecomment-1158573017 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ARUXJ2NWF2NW62VJQ2JGI3LVPQRMNANCNFSM5XG22OKQ . You are receiving this because you commented. https://github.com/notifications/beacon/ARUXJ2IVXKYJMOCTE4I363LVPQRMNA5CNFSM5XG22OK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIUHGXWI.gif Message ID: @. @.> >

CamJava007 commented 2 years ago

But they were perfectly stable before the 0.10 update. I was never getting the DEAD GPU warning before! And why are we getting these errors 6 or 12 hours after rig re-start? My conclusion is that TRM 0.10 needs a fix. From: Cédric CRISPIN @.> Sent: June 17, 2022 3:21 AM To: todxx/teamredminer @.> Cc: Daniel140220 @.>; Comment @.> Subject: Re: [todxx/teamredminer] Rmode causing crashes every 3 hours on random gpus. 12 navi/big navi (Issue #618) In my experience, it just mean your overclocking numbers are not stable, also check out memory temperature close the 3 hours :) — Reply to this email directly, view it on GitHub <#618 (comment)> , or unsubscribe https://github.com/notifications/unsubscribe-auth/ARUXJ2NWF2NW62VJQ2JGI3LVPQRMNANCNFSM5XG22OKQ . You are receiving this because you commented. https://github.com/notifications/beacon/ARUXJ2IVXKYJMOCTE4I363LVPQRMNA5CNFSM5XG22OK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIUHGXWI.gif Message ID: @. @.> >

I absolutely agree with you. I never had an issue with any other version... if I switch back to 0.9.4.2, everything works fine.
I see these comments everywhere... so there is clearly an issue with 0.10. ALSO... 0.10.1 is no better.

Daniel140220 commented 2 years ago

I don't know what are the criteria to close this...this is an on-going issue and it has not been solved by the TRM team, to my knowledge.

blackmennewstyle commented 2 years ago

Have you guys had a look here https://github.com/todxx/teamredminer/blob/master/doc/ETHASH_SMOOTH_POWER.md? It's a new feature introduced since TRM 0.10.0 and it sounds like it could be potentially related to your issue.

CamJava007 commented 2 years ago

It's not just RMode... it's all modes. Not just linux, windows too. It's so bad I had to find another program. I really hope they fix this.

ghost00769 commented 2 years ago

I'm the op and it got closed due to account deletion. I'm currently testing out removing smooth power while using rmode so in a day or 2 I'll know if that works and post back here. Obviously any version before .10 works well and .10+ works on some systems and not on others. It seems more reasonable that it's smooth power related and not rmode. Almost all of my crashes have absolutely 0 reason to happen so it seems logical to me that smooth power just sucks the D on some cards or rig build outs.

Edit: each version after the original .10 results in 6-10 hours until a crash as opposed to 3-5 and smooth power was the only thing altered that applies to me in those later versions.

GYKrauss commented 2 years ago

It should be a beta feature IMHO until this is resolved.

On Sun, Jul 3, 2022, 2:29 PM ghost00769 @.***> wrote:

I'm the op and it got closed due to account deletion. I'm currently testing out removing smooth power while using rmode so in a day or 2 I'll know if that works and post back here. Obviously any version before .10 works well and .10+ works on some systems and not on others. It seems more reasonable that it's smooth power related and not rmode. Almost all of my crashes have absolutely 0 reason to happen so it seems logical to me that smooth power just sucks the D on some cards or rig build outs.

— Reply to this email directly, view it on GitHub https://github.com/todxx/teamredminer/issues/618#issuecomment-1173064426, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC5CIDDXH33DWLVR4R65RJTVSF2SHANCNFSM5XG22OKQ . You are receiving this because you commented.Message ID: @.***>

blackmennewstyle commented 2 years ago

I'm the op and it got closed due to account deletion. I'm currently testing out removing smooth power while using rmode so in a day or 2 I'll know if that works and post back here. Obviously any version before .10 works well and .10+ works on some systems and not on others. It seems more reasonable that it's smooth power related and not rmode. Almost all of my crashes have absolutely 0 reason to happen so it seems logical to me that smooth power just sucks the D on some cards or rig build outs.

Edit: each version after the original .10 results in 6-10 hours until a crash as opposed to 3-5 and smooth power was the only thing altered that applies to me in those later versions.

Like i already stated before, if your rig always crashes after a certain amount of time, it generally means your current overclocking settings are not stable. A perfect stable mining rig should be easily able to mine days without crashing, within acceptable temperature environment. The devs provided clearly in the article, couple of great insights about Smooth Power and why it could be an issue especially for NAVI GPUs.

ghost00769 commented 2 years ago

No one here has had issues due to OC but SMOOTH POWER IS THE ISSUE, wish I would've noticed their note about it but I wouldn't think it would cause this terrible behavior. I've read the part in the article about what smooth power was doing but not the bottom, so thanks for pointing it out. Rmode is awesome, can't wait to dial it in throughout the next 2 or 3 days but smooth power is not awesome yet. --eth_smooth_power=0 to disable it on your rig if you are having these issues, you can also select what gpu's you want to enable it on by using 1 i.e --eth_smooth_power=1,0,1,0,0,1.

Here's what I'm looking at currently with stability, 1200 seems to be a good value for 6800s+. Gonna bump the 5700's back up a little on mem and drop voltages until I'm sub 70W.

todxx / teamredminer

Rmode causing crashes every 3 hours on random gpus. 12 navi/big navi #618