rdugan / iceriver-oc

Modified firmware for IceRiver ASICs, adding overclocking and other goodies
96 stars 3 forks source link

IceRiver Overclocking Firmware

Modified firmware for all IceRiver ASICs, adding clock and voltage control, sensor graphing, properly secured login and API access, and other goodies.

Customizable OC/OV, small fee benefitting the community, no unnecessary changes to your device.

Firmware files can be downloaded from the Releases section on the right hand side of this page.

If you have any issues, finding me (pbfarmer) in the Kaspa Discord will probably result in the fastest response/resolution.


Table of Contents


Special Thanks

None of these firmwares would be possible without a number of people's efforts in testing and feedback.

However, one person in particular has sacrificed his machines from the beginning, granting me direct access for development, allowing me to risk his machines while testing brand new features, and suffering numerous mining interruptions during the frequent updates and restarts.

This person goes by the Discord handle Onslivion - it would be great if you could drop him a thanks on the Kaspa Discord, and maybe even send him a tip or some of your hashrate:

kaspa:qzh2xglq33clvzm8820xsj7nnvtudaulnewxwl2kn0ydw9epkqgs2cjw6dh3y


Fee


Known Issues


Features

Configuration additions and updates


Configurable clock and voltage offset

Performance Settings

Clock and voltage settings have been added to the 'Miner' page. Clock can be increased/decreased to any integer value (within hardware limits). Changes take effect immediately without restart, but note that clock increases are gradually applied in increments of 25Mhz per 30s. As a result, it may take some time to get to full speed, possibly even ~10 minutes, depending on how large of an offset you choose.

Voltage can also be increased/decreased to any integer value (within hardware limits), with changes taking effect immediately. Settings will be rounded down to the nearest multiple of 6.25mV internally for everything but KS0 Pro. A simple model to keep in mind is that for every 25mv increase, the proper increments are 7mv-6mv-6mv-6mv, or for example, 7, 13, 19, 25 for the first 25mv.

For KS0 Pro, voltage can be adjusted in 2mV increments.

VOLTAGE CONTROL IS NOT AVAILABLE FOR KS3/M/L AT THIS TIME.


IMPORTANT: THERE ARE CURRENTLY NO GUARDRAILS, AND NO LIMITS ENFORCED BY THIS SOFTWARE ON EITHER CLOCKS OR VOLTAGE, SO USE WITH CARE.


Better fan controls

Performance Settings

A new fan mode has been added which automatically adjusts fan speed to maintain both max hash chip and board temperatures. Temps are read every 10s and fan speed is adjusted as necessary.

Please note, this setting does not guarantee the set temperature. It may be exceeded by up to ~5C during startup or other dynamic periods, but it should stabilize at or near the requested temperature.

If you find the target temps are exceeded beyond your comfort during startup or other dynamic periods, you should increase the min fan speed.

Fixed fan speeds will also now be reapplied at startup, after a ~1-2m delay, though it is a one-time application. This means that if the underlying IceRiver software decides to change the fan speed again for some reason, this mode will not re-apply your setting. Consider using the 'Target Temp' mode with an appropriate min fan speed as an alternative.



Additional telemetry and other changes to home page


Graphing of chip metrics and longer term hashrates

Home Page

Two hours of graphing has been added for all chip metrics, with filters for summaries (per board min/max/avg), board, or all chips.

80c chip temps appear to result in ideal hashrate performance (though this may be difficult on KS0/Pro without cooling mods.) No guidance has been provided by IceRiver as to safe chip temp limits, but their miner software appears to restrict clock raises above 95C, and will actually throttle clocks above 110C. At least following general guidance from G/CPUs is probably prudent (e.g. >90C warning zone, >95C danger zone, >105C critical zone).

Please note that real-time voltage will never match your setting - drivers under load experience voltage drop, meaning the running voltage will always be below your voltage setting, with more load causing a greater drop. Chip voltage will be replaced by power draw for KS5L/M, as there is no chip voltage reading available. A software limit of 3350W is enforced on these models, where cores will be disabled in groups of 4 should you exceed this limit.

Board temp graphs have been added for all models, which includes intake, and exhaust sensor temps, as well as power stage (driver) temps for KS0/Pro/Ultra, KS1, and KS2. In summary mode, the max power stage temp is shown for each board, while in board mode, the max power stage temp is shown for each group/controller (PSG). Max recommended operating temp is 125C according to the chip documentation, though it is probably wise to keep a healthy margin below this temp.

Please be aware, that temperature is not the only consideration for healthy operation. Power/current draw is also a concern, for which we don't currently have visibility or specifications.

Hashrate graphing (as well as the headline stats) now includes 30m and 2hr tracking, and also includes board level filtering.

Mouseover tooltips have been synchronized across all graphs, to help with diagnosing issues / anomalies.

Instantaneous values are shown in the legend, and individual lines can be disabled/enabled by clicking on the labels. Graph scales are no longer zero based, and adjust depending on which lines are displayed, meaning they are no longer artifically flattend by poor resolution, and you can actually see the variability in each measurement.

Hopefully this helps clear up how variable 5m readings really are.


Uptime and job rate on pool status

Home Page

The uninterrupted uptime, and job issuance rate are added to the pool stats section. Job rate is simply an additional health indicator of a pool connection - currently job rates for the Kaspa network should be around 1 per second (soon to be 10/s with Rust deployment) with a variation of roughly +/- 15%. While job rates consistently higher or lower than this should not technically affect your earnings due to Kaspa's block acceptance policy (assuming the pool is not unnecessarily rejecting 'old' shares), it is a signal that the pool may not be functioning properly, and you may want to alert the pool operator, or possibly find another option.

It has been communicated by kaspa-pool operators that they intentionally reduce job rate to limit overhead, and that it doesn't affect stale share rates in their case

Multiple status indicators have been added to the pool section to help diagnose different network / pool issues. A gray busy (spinning) icon indicates the asic is attempting to connect to the pool. A green busy icon indicates a network connection, but no stratum connection yet. A yellow warning icon indicates a successful stratum connection, but no jobs have been received.


API

While the previously available API on port 4111 is still available, a new rationalized API including all of the additional features from the UI is now avaiable over https (port 443).

Full documentation is available in json format.


General UI improvements


Optional commercial Features

A number of features have been added to a separate 'commercial' build intended for hosting or other large deployments. These builds include a 'c' after the version number (e.g. pbv081c_ks5mupdate.bgz), and currently have an additional 0.33% fee (1.33% vs the standard 1%).

Multiple user management

In addition to the standard primary/admin user, multiple users w/ differing access permissions can be added. This allows for setups where, for example, the machine owner can be granted direct access to the machine, with permissions to view the main monitoring page and change pool configurations, while being restricted from changing network, fan or clock/voltage parameters.


Hashrate splits

The hashpower of the ASIC can be split to multiple endpoints based on a configurable percentage, to allow setting up hosting fees. The number of splits is not limited, but please keep in mind that the firmware will maintain a pool/stratum connection for each split, which multiplies the incoming traffic.

This feature can also be used for splitting hashrate across multiple KHeavyHash coins at once


Branding/Logo replacement

The 'PbFarmer' logo can be replaced with a branding image of your choice. The image format should be a 112x60 PNG.



Stability and security improvements

Primary pool health monitor

Health-check loop run on primary pool availability. If miner has switched to one of the secondary pools for any reason, you will be switched back to your primary pool as soon as it becomes available again.


Fix for web server crashes

Replaced stock web server with updated and production environment targeted version, added cache/memory control configuration, and fixed memory leaks. This should address the issues seen by users of HiveOS and other external monitoring tools that caused the web server to crash after too many page loads (resulting in the ASIC UI being unavailable.)


New auth/auth routines

The authentication and authorization controls have been completely replaced, and all traffic redirected over https. This means forwarding the http(s) traffic through your firewall for off-site monitoring should be much safer (though I would still not necessarily recommend this - simply due to best security practices...) Login is no longer transmitted over unsecured http, and people can no longer hijack your asic simply by setting a cookie to skip login. The random 'login incorrect' messages due to file system corruption should also be a thing of the past. Please keep in mind this will mean your password will be reset to stock default after first installing. Also, the first boot after installation will take 2+ minutes, as the machine generates the TLS certificates.

Additionally the redesigned API has been secured w/ an access token, through which granular permissions can be assigned. Tokens should be included with API requests in a header of the form 'Authorization: Bearer \<token>'.

Account Page

Just as you would update the login password, PLEASE DELETE/REPLACE THIS API TOKEN if you plan on exposing your machine publicly, as it is the same across all machines by default.


TLS certificate management

The TLS certificates (and certificate authority) for https are automatically generated on the ASIC, meaning they will cause 'Not secure' warnings in your browser since they are not from a well known authority. While harmless, these warnings can be annoying, so the firmware provides the ability to download the CA certificate so it can be uploaded into your browsers certificate store.

TLS Certs

To do so in Chrome, for instance, go to chrome://settings/security, click on 'Manage Certificates', select the 'Trusted Root Certification Authorities' tab (or just 'Authorities' for Linux), and click on the import button. After restarting your browser, you should no longer see the 'Not secure' warning.

If you have multiple ASICs, you will have a different CA for each by default. However, instead of adding each of these to your browser(s) or other devices, you can propagate a single CA across all ASICs by downloading both the CA certificate and CA key from one ASIC, uploading both files to all of your other ASICs, then regenerating the certificate on each of those other ASICs.

If you access your ASIC via a domain name or multiple IPs, you can also add these to the TLS certificate by listing them in the 'Regenerate certificate' field and clicking 'regenerate'.


Healthcheck loop

A healthcheck loop has been added, which will automatically restart the miner or web server should either crash for any reason.

Additionally, the 'reset' executable that has been found to randomly disappear from peoples machines (even stock setups), is now packaged with the firmware, and a healthcheck loop has been added to replace/restart the file if necessary. This should address the 30m reboot loops many people are experiencing.


Installation

DO NOT install over the xyys (including tswift branded) firmware on KS0 Ultras or KS5* models. Please make sure to follow his uninstallation instructions before installing this or any other firmware!

This is a standard firmware update package, including/improving on the latest IceRiver firmware, and applied just as official firmware would be. Applying over any previous updates should work for KS0/Pro, KS1, KS2, and KS3 models. Applying over stock, or previous versions of this firmware should also work for KS0 Ultra, and KS5 models.

However, if you run into problems, try the following process:

Also, make sure to redo your pool settings, as they will have been reset to the default Kaspa Dev Fund address.


Usage Tips

Power and metering

Laptop power supplies for KS0/Pro/Ultra models should generally be 19.5V with 5.5mm x 2.5mm connectors, but the amp rating can vary depending on your OC targets. However, barrel connectors of this size tend to be rated for either 5 or 10a, and it is unlikely IceRiver used 5a options, so it would be a reasonable assumption that they used 10a (7.5a is another possibility). This means that any adapter over 200w is likely exceeding the rating of the socket, such that the plug could melt or even catch fire, if not actively cooled (even then the risk remains). Please be extremely careful should you choose to use one of the higher power laptop charger options.

It is highly recommended you have a power meter attached to your machines, to ensure you are within your PSU limits. This is especially true for KS3 and KS5 models, which have very little PSU headroom even at stock settings, as well as KS0* models due to the wide range of power supplies.


Cooling

KS0 Pro and Ultra models need special attention to cooling. The power stages on these already run very hot, so hardware modifications for improved cooling are highly recommended - including heatsinks, and better airflow.

Hash chips on all models tend to perform best in the 75-80c range, but this is especially true for the KS0 Ultra, where even reducing from 80c to 75c, I've experienced a drop in the 2hr hashrate of > 3%.


Tuning

CLOCK OFFSET PERCENTAGE AND HASHRATE INCREASE PERCENTAGE SHOULD BE EQUAL ON A HEALTHY MACHINE.

E.g. if your clock offset is 30% on a KS1, then your hashrate should be 1.3TH/s, or 30% more than the default 1 TH/s. If this is not the case (over an appropriate measurement window,) then it means your chips are starved for voltage.

Proper tuning is a process that takes time. Using other peoples settings is generally not a great idea, as every machine is different. Best practice is to start at a conservative clock offset that results in a matching hashrate increase with no voltage changes. As you further raise your clocks in small increments (e.g. 25mhz or less), once you no longer see hashrate respond 1:1 (or maybe even start dropping), it is an indication that more voltage is needed.

At that point, increase voltage by a single step (2mv for KS0 Pro, 7 or 6mv depending on current level for all other models), then see if hashrate responds. If it does, and once again equals clock offset on a percentage basis, go back to raising clock. Continue this back and forth between clock and voltage offsets until you reach your desired hashrate, while being mindful of temperature and power limits.

While 5m and 30m hashrates in the GUI are useful tools for directional guidance after the machine has had time to ramp up, final hashrate measurements should be done over an extended time period. 5 minute hashrate readings are quite variable, and even 30 minute hashrate readings aren't great, as you can still have a couple percent variability. The 2hr reading in the UI should have less than 1% variability from my experience (may be slightly above 1% on KS5L/M and KS0Ultra), though it doesn't take hardware errors / pool rejections into account.


Reproducing results from other firmware

And finally, if you are trying to replicate the OC results of another firmware...


Let's Talk About Hashrates

All OC firmware, including this one, only control clocks and voltages. It is my experience that given the necessary voltage, the hashrate responds linearly on a 1:1 basis to the clock change, percentage-wise. But in the end, all we can do is change the clock and hope the ASIC responds with the expected hashrate change.

Hashrate readings in the ASIC UI are not like those from CPU/GPU mining. IceRiver ASICs are not counting actual hashes - they are simply estimating hashrate based on the number of shares produced * difficulty. This is exactly how a pool measures your hashrate, but the problem is most pools decided to use way too high of a difficulty for IceRiver ASICs, which prevents reliable short term hashrate measurements - with a high diff, the share rate is low, which means wild swings in hashrate. As a result, IceRiver released a firmware update which started using a completely different, lower internal difficulty for hashrate measurements on their own dashboard.

Therefore, even for the same exact timeframe, you cannot reliably compare a pool hashrate measurement to the ASIC UI hashrate - they are not using the same data. To further exacerbate this, since IceRiver machines were generating a large number of invalid shares early on, a number of pools decided to stop reporting rejected shares back to the ASIC so users would stop complaining (or switching pools), and instead report them as accepted, while still silently rejecting them on their end. Depending on the true reject rate, this can mean a significant divergence between the ASIC hashrate and the pool hashrate, even if they were measured using the same timeframe and difficulty.

Regardless of the diff selected, hashrate measurements based on shares * difficulty are subject to swings based on luck. The lower the share count (higher the diff), the more luck affects the hashrate, and the wilder the swings. Thus, to have a statistically meaningful hashrate measurement, you need enough shares to reduce the effect of luck as much as possible. The 5m reading on the ASIC are not suitable for this, especially when trying to verify the result of single digit OC changes, and short term pool readings are even worse.

You need 1200 shares just to get to an expected variance of +/- 10% with 99% confidence. E.g. for an expected hashrate of 1TH/s, in 99/100 measurements after 1200 shares, you will have a reading between 0.9TH/s and 1.1TH/s. You need 4800 shares to reduce that variance to +/- 5%. Many pools are using difficulties that produce share rates in the ~5 shares/min range. Therefore, just to get a hashrate reading with an expected variance of <= +/- 10%, you would need a 1200 / 5 = 240 minute, or 4 hour reading. If you want a reading with an expected variance +/- 5%, you would need over 16 hours of data. You will never be able to confirm the results of an OC level below the expected variance of a given timeframe. For example, you cannot possibly determine whether a 5% OC is working properly in a 4 hr / 1200 share window having 10% expected variance. Even at 16hrs / 4800 shares, the expected variance can completely cancel out a 5% OC.

And this leads to the crux of the issue - most pools do not provide anything higher than a 24hr measurement, which at ~5 shares/minute means roughly 7200 shares, which is still a 4% expected variance. You need 10K shares just for 3.3% variance, and about 100K shares for a 1% variance. The 30m reading in the ASIC UI should have around a 2% variance, and the new 2hr reading should have less than 1% variance, but neither reflect the pool rejects. Therefore, the only solution then, is to find a pool that lets you set your own difficulty, so that you can generate a statistically relevant number of shares for their available timeframes. Herominers is one such pool that allows this.

The best option for setting your own diff and seeing long enough measurement timeframes is solo mining to your own node and the kaspa-stratum-bridge. The default vardiff settings will produce a minimum 20 shares/min, which is enough to have <= +/- 5% variance in 4hrs, and the dashboard (grafana) allows measurements in any timeframe/resolution you want, including signficantly longer timeframes than 24hrs.

As a concrete example of the difference between valid and invalid measurements (as well as how kaspa-stratum-bridge can help), here's the hashrate readings of 3 machines using diffs producing >= 30 shares/min, a KS0 at 51% OC, a KS1 at 37% OC, and a KS3M at 1% OC. The measurements are, from top to bottom, 24hrs (>= 43K shares), 1hr (>= 1800 shares), and 30m (> 900 shares). You can see how divergent the measurements can be from expected for the shorter timeframes:

KS0, KS1, KS3M OC Hashrates

In short, if you are trying to confirm the effects of a small OC on the ASIC UI, you will need to use the 2hr reading, but you won't know whether you're generating shares that would be rejected. To get the full picture, you will need a long term measurement from a pool that allows high share rates - and there are no options I can point to that can do this at this time, other than mining to your own node + kaspa-stratum-bridge.