integralfx / MemTestHelper

C# WPF to automate HCI MemTest
2.24k stars 211 forks source link

Notes to include in your guide #31

Closed nitorita closed 3 years ago

nitorita commented 3 years ago

I highly advise warning that corruption is possible during RAM overclocking, and to advise users to regularly do sfc /scannow checks every so often after a change is made to make sure there isn't any. From personal experience, tRFC+tRP and RTL/IOLs can potentially cause corruption if overly tightened.

Over extensive testing, I have found that tRCD+tRP raises the IMC voltage requirement. Thus, I suggest tweaking it after all other timings are done rather than in the beginning, as other timings tweaks may be stable but instability could be wrongly attributed due to insufficient IMC voltage. This is a good resource for common BSOD error codes: https://www.reddit.com/r/overclocking/comments/atwtt5/psa_bsod_codes_when_ocing_and_possible_actions/

When tRFC is too tight, the PC can freeze, so that is something to mention for people to keep in mind. Also, increasing VDIMM allows tRFC to be tightened further. Here is a convenient chart of tRFC values to test (not mine): https://i.imgur.com/6Zg1MKy.png

Due to RAM being heat sensitive, it is recommended (for gamers) to perform a lengthy game test at least 80% GPU usage to test whether the GPU's ambient heat will cause the PC to BSOD. I suggest a constant test of at least six hours; I've had BSODs occur anywhere between 1-5 hours.

The explanation for tRC is somewhat confusing; rewording it would be nice. An easy way to explain it would be: If reducing tRFC+tRP and tRAS can boot but cannot pass TM5, gradually raise tRAS.

The easy way to tighten RTL and IOLs is to first train them on Auto and then for every -1 decrease in IOL, you want to decrease the same matching RTL by -1. It is sometimes more efficient and stable for all IOLs to be same. Some motherboards do not like when IOLs are manually changed, so you would increase the IOL Offset instead.

At the end of the guide, you could add a note that those with Samsung B-die or Micron B/E-die can experiment raising their voltage beyond the 1.5V JEDEC spec, at their own risk and discretion. Up to 1.6V should be safe for most people, so long as they have proper cooling for their RAM. This will allow a further reduction of CAS latency and increase of frequency.

You should advise people to run a benchmark such as AIDA64 after each change to make sure there is an actual improvement in performance. Many timings have regressive performance if they are set too low.

integralfx commented 3 years ago

Also, increasing VDIMM allows tRFC to be tightened further.

Not on all ICs.

The explanation for tRC is somewhat confusing; rewording it would be nice.

Yeah I've had a few people tell me it was a bit confusing. I'll reword it.

At the end of the guide, you could add a note that those with Samsung B-die or Micron B/E-die can experiment raising their voltage beyond the 1.5V JEDEC spec, at their own risk and discretion. Up to 1.6V should be safe for most people, so long as they have proper cooling for their RAM. This will allow a further reduction of CAS latency and increase of frequency.

There are sticks rated at 1.65v XMP on the B550 Unify-X Renoir QVL which suggests that it's safe to daily that at least on Renoir. On Ryzen 3000/5000 there is a kit rated at 1.6v so it's potentially safe to daily 1.6v.

You should advise people to run a benchmark such as AIDA64 after each change to make sure there is an actual improvement in performance.

Already in the guide in the tightening timings section.

Thanks for your notes. I'll add them in the guide.

nitorita commented 3 years ago

I did a bit more testing regarding specific timings. I tried super tightening them just to see how things would turn out. (Note: For the AIDA64 tests I performed, I did each test around 30+ times to rule out variance.)

tRCD+tRP is both voltage and heat sensitive, which means that there is a sweet spot effect with VDIMM. To tighten it further, it is advisable to keep other timings looser in order to keep heat-related errors at a minimum. If you can boot when you tighten them to a specific value, opt to try to reduce temperatures first before increasing VDIMM, as errors might not necessarily imply that there isn't enough voltage but that the RAM is just too hot.

tWR can go below 10 (the lowest I could test was 5), but the AIDA64 performance was basically the same as 10. I suggest 10 as the lowest practical value. Moreover, tightening tWR increases heat significantly, so I suggest keeping it loose until other timings are done being tightened.

tRRD_S/tFAW can go all the way down to 1/4 or 2/8, but Read speed take a steep penalty at those settings, and Copy speed is slightly reduced as well. Latency also slightly increases. However, if you set tRRD_S/tFAW to 3/12, Read speeds will increase by a decent amount and latency will drop significantly, compared to 4/16. I suggest 3/12 as the Extreme target.

tWTR_S can go down to 1, but it is basically the same as being set to 2 but with incredibly volatile latency. 2 provides the best performance overall. tWTR_L can go down to 4 for better performance (that's the lowest I could test; it could go lower for others).

tRTP below 8 incurs an insane latency penalty. 8 should be the absolute lowest value to strive for.

tRDWR_## can cause freezing if it is too tight. tWRRD_dd can be dropped to the floor, but performance was worse and there were irregularities when using the PC. Try to stay around 4 for that particular tertiary. All of the _dr tertiaries can be set to 0 if the user has single rank memory.

tCKE can be set to 0, depending on the PC. But I've heard mixed reports from others regarding performance, so I highly suggest benching to make sure performance doesn't worsen the lower you go.

tRAS sometimes doesn't like even numbers and refuses to POST with them. In that case, round up/down.

tREFI can be maxed out, but I don't recommend it for daily use. There are plenty of academic articles out there implying that is potentially dangerous to use at higher values. It is heat sensitive and in itself adds heat as well, which make overclocking a lot more annoying if it isn't done after every other timing. From what I've seen among most reasonable suggestions, the safe area to be is about 2-3x the JEDEC spec. That is: Frequency / 2 7.8 (JEDEC) 2-3

In a similar light, its sister timing tRFC also increases heat when tightened.

For good practice, RAM should be flushed whenever there is a BSOD or errors detected in TM5. Latent errors can get left behind in the RAM, causing subsequent tests to throw errors as well (and even with stable overclocks). To do so, shut down and disconnect the power for at least 30-60 seconds.

I highly suggest recommending 1usmus' config for TM5 as as another option for the TM5 config. He is the creator of the famous Ryzen DRAM Calculator, and it is a very thorough test that is quicker than anta777's and also incredibly accurate.

In TM5, if you can pass one or two cycles but get errors in a later cycle, this is indicative of a temperature issue almost every time. Add a fan or loosen timings to reduce heat and retest. Loosening tWR, tRFC, and tREFI is a good way.

Here is a link to a helpful resource for diagnosing TM5 errors. Veii, who currently has one of the best Ryzen 5000 series RAM overclocks, has thoroughly tested and detailed what each error likely implies. To access his notes, you look through the drop-down list labelled "TM5 ERROR". Although Ryzen is AMD, his notes are not platform specific.

On a side note, I believe you should make a thorough section about RTLs and IOLs, simply because it drastically increases bandwidth and lowers latency. When they are properly set, it is almost equivalent to dropping tCL by 1-2. Unlike other timings, they are linked, so you must lower both the RTL and the equivalent IOL together, or else it won't post. Here is a guide that clearly summarizes how they work.

integralfx commented 3 years ago

tRCD+tRP is both voltage and heat sensitive, which means that there is a sweet spot effect with VDIMM.

That's dependent on the IC.

For good practice, RAM should be flushed whenever there is a BSOD or errors detected in TM5. Latent errors can get left behind in the RAM, causing subsequent tests to throw errors as well (and even with stable overclocks). To do so, shut down and disconnect the power for at least 30-60 seconds.

AFAIK errors are when you write a value and the value read back isn't the same. The values themselves aren't errors, but just numbers. Not sure what you mean by "errors can get left behind".

I highly suggest recommending 1usmus' config for TM5 as as another option for the TM5 config. He is the creator of the famous Ryzen DRAM Calculator, and it is a very thorough test that is quicker than anta777's and also incredibly accurate.

I did try 1usmus' config but anta777's was faster at finding errors in my experience. It's a decent config nonetheless.

Here is a link to a helpful resource for diagnosing TM5 errors.

Some of the descriptions are a bit vague and with the way the guide suggests to tighten timings, whatever timing was last changed will be the cause.

On a side note, I believe you should make a thorough section about RTLs and IOLs, simply because it drastically increases bandwidth and lowers latency. When they are properly set, it is almost equivalent to dropping tCL by 1-2. Unlike other timings, they are linked, so you must lower both the RTL and the equivalent IOL together, or else it won't post. Here is a guide that clearly summarizes how they work.

I agree but I don't have an Intel setup so that's out of of scope for now. The guide seems good but as it's machine translated I wouldn't vouch for the accuracy of the information.

nitorita commented 3 years ago

What I meant by "errors left behind" is that, if I get BSODs or errors in TM5, and then I restart the PC and change the bad timing value to even a stable one, it can throw an error in TM5 even though it never threw one before. This is just a flaw with how RAM works; the RAM doesn't always flush clean. You have to completely disconnect the power so the RAM can clear out. This is important to mention, as people may quickly change timings to other values to test, but can't figure out why memory tests continue to fail when some values might actually be stable.

Most on Overclock.net follow more or less the same method of tightening RTLs and IOLs, so they can vouch for its accuracy. If anything, the current (older) guide you linked is frankly much more confusing to follow and shouldn't be referred to. Considering how significant of a performance boost there is with the RTLs and IOLs, I don't see why it should be omitted from the guide. Again, I've simplified it already, you can copy it verbatim if you wish:

Set all IOLs to the same value and gradually reduce them together by 1. Since the RTLs are linked to the IOLs, they must be reduced by by the same amount as well.

It's a lot simpler than it sounds, as motherboards will only accept specific RTL/IOL values and absolutely refuse to POST otherwise. RTL/IOLs cannot be reduced on their own.

On some motherboards, they might not like when the RTL/IOLs are manually reduced, so the only alternative is to raise the IOL Offset instead, which gives a similar result but should be used as a last resort as it doesn't always play nice with RAM.

As for the Ryzen Calculator resource for diagnosing TM5 errors, although the descriptions may sound vague, they are still better than blindly guessing what the issue is, since a tightened timing may not necessarily be unstable but just not properly tweaked alongside other settings. Doesn't hurt to include a link for people interested in checking it out for reference.

integralfx commented 3 years ago

tRRD_S/tFAW can go all the way down to 1/4 or 2/8, but Read speed take a steep penalty at those settings, and Copy speed is slightly reduced as well. Latency also slightly increases. However, if you set tRRD_S/tFAW to 3/12, Read speeds will increase by a decent amount and latency will drop significantly, compared to 4/16. I suggest 3/12 as the Extreme target.

tWTR_S can go down to 1, but it is basically the same as being set to 2 but with incredibly volatile latency. 2 provides the best performance overall. tWTR_L can go down to 4 for better performance (that's the lowest I could test; it could go lower for others).

tRTP below 8 incurs an insane latency penalty. 8 should be the absolute lowest value to strive for.

MLC: image

AIDA: image

Doesn't seem to make much of a difference and in some cases performance was slightly worse.

Much thanks to Tyllo for testing this.

I'll merge the changes now as it has much needed fixes and info so feel free to make a new issue that proves your claims.