skot / ESP-Miner

A bitcoin ASIC miner for the ESP32
GNU General Public License v3.0
317 stars 114 forks source link

Firmware 2.1.9 stops working after several hours with "Serial RX Invalid 11" error #268

Closed HypeLaser closed 1 month ago

HypeLaser commented 1 month ago

After updating to 2.1.9, devices seemed to settle after an hour, running at or above their target hash.

However, at 8hrs in I noticed three of my four devices had dramatically dropped hash rates (80GHz or so) and the following error messages:

₿ (37720852) bm1366Module: Serial RX invalid 11 ₿ (37720852) bm1366Module: 35 3e 3d 8f aa 55 8a 01 f3 6f 02 ₿ (37722862) bm1366Module: Serial RX invalid 11 ₿ (37722862) bm1366Module: 43 69 13 82 aa 55 52 02 ba 55 01

All three of these devices were connected to the Public Pool address. The fourth, which is still running, is connected to Dutch.nl and seems fine.

Rebooting the devices has stopped the error, and the device operate as normal. I await the 8hr mark to see if the errors appear again, or if they appear on my fourth and last device.

skot commented 1 month ago

which hardware version is this? what power supply are you using?

HypeLaser commented 1 month ago

1x Max, Board 204, Model BM1366 2x Supra, Board 401, Model BM1368

Using the power supplies that shipped with the devices.

HypeLaser commented 1 month ago

It took 38 minutes for the error to show again on one of the Supra's, and now I can also see the hash rate on the Max is wildly over...

Screenshot 2024-08-05 at 23 41 58
skot commented 1 month ago

there are many different power supplies depending on when and where you got your Bitaxe... can you give me some more details? I think this may be a power issue.

skot commented 1 month ago

maybe you could also post a screenshot of the dashboard from all your devices when this problem is happening.

HypeLaser commented 1 month ago

Power supply failures on three devices on the same day seems unlikely, but worth investigating for sure. All devices were bought from Bitcoin Merch, the Ultra (not Max!) in Feb 2024 and the Supra's in March 2024.

skot commented 1 month ago

(fwiw 204 is an ultra). How did you update the firmware on all these Bitaxe?

HypeLaser commented 1 month ago

Thank you for letting me know it is an Ultra. It is confusing keeping across what the devices are.

I manually clicked the GitHub links on the devices internal page, on the Settings page. I updated the firmware and the website from the files that downloaded from GitHub.

Screenshot 2024-08-05 at 23 54 47
skot commented 1 month ago

ok, you said only the public-pool.io pointed Bitaxe were giving you trouble. maybe this is related to a bug in handling pool outages (of which there were a couple today). I'll set up a 204 and 401 pointed to PP and see if I can reproduce this.

HypeLaser commented 1 month ago

Just an overnight update.

The Dutch.nl Bitaxe is still running with no issues since upgrading to 2.1.9, and the three Public Pool have also run overnight since the reset with no issues.

I'm now suspecting the PP outage is what caused this, as you mentioned.

Out of curiosity, when PP came back online, should the devices have reconnected by themselves?

Georges760 commented 1 month ago

seeing the aa 55 in the middle of the frame, look like there was a desync of the framing.

WantClue commented 1 month ago

I'll check on the last commits maybe some commit changes something unintentionally

HypeLaser commented 1 month ago

Two of the Public Pool devices have stopped hashing, same as before.

attached are the Dashboards, as requested. Same error "Serial RX invalid 11". These are both Supra's, board 401.

a Screenshot 2024-08-06 at 19 17 40

Also. Clicking the RESTART button doesn't restart. The device goes offline and I cannot access it again without turning the power off and on again.

Poncelas commented 1 month ago

Same issue here with the device in Public Pool. After some hours is showing 5234GH/s and not working. Restarted and worked again. Happened twice. Just bought it 1 day ago and updated to 2.1.9

jddebug commented 1 month ago

Same issue here with the device in Public Pool. After some hours is showing 5234GH/s and not working. Restarted and worked again. Happened twice. Just bought it 1 day ago and updated to 2.1.9

Same happening to me on my 6 devices 2.1.9

skot commented 1 month ago

this just happened to me! BitaxeUltra 202 running v2.1.9 firmware. I have a suspicion that it happened with a public-pool outage overnight, but it's hard to know for sure. 2 other Bitaxes still running fine.

skot commented 1 month ago
image

this is the only indication from the dashboard, all other stats look good. Log shows tons of Serial RX invalid 11

₿ (149651229) bm1366Module: 9e aa 55 a8 00 51 94 02 55 05 dd
₿ (149651719) bm1366Module: Serial RX invalid 11
₿ (149651729) bm1366Module: 91 aa 55 08 00 68 4c 00 53 29 23
₿ (149653189) bm1366Module: Serial RX invalid 11
₿ (149653189) bm1366Module: 90 aa 55 70 02 0d 82 01 59 03 21
₿ (149653249) stratum_task: rx: {"id":null,"method":"mining.notify","params":["da1e63","4d9862332c27aaf3293856481f2ac7f71d6762c50001848e0000000000000000","02000000010000000000000000000000000000000000000000000000000000000000000000ffffffff1703ea0e0d5075626c69632d506f6f6c","ffffffff021d7df51200000000160014c64b1b9283ba1ea86bb9e7b696b0c8f68dad04000000000000000000266a24aa21a9ed4ca029d391eb165ea7dc0a0d4280ac260fd2ce2f86d678ff70640eeadeae0fed00000000",["f1cb32c85599d2c5a793b6ad6b11497f12c242e9055e3a312ebf1b62142d4e3a","e7d63762b78730203129046f64e1be4e4c077c67ad635fd7320306cdbfef9c23","1f7615138e4cc031c3d3139445a8f212b7e5c1c0c6d20d322b03fca02a404b63","cad7df613660d2b5c56650e1403e91d0fd96986fd7582b86bb7ed081b95ae7d9","3113d0a63c3ea791dcecbf6b46740341149f420aa1fc4fd4be49f8a21c8f026d","91167e0fb95a1638aa931b3de009a796417f835bc20899d792af0dca40348f99","109e75786424c730d25a3c58b1e431d0b7690dac7a55d2d5fd253a82f0d04be3","87a92f67ac5d54d85ffe7736e79812c7f761d7bbbde94907696bb2041fe7ef19","46fbd352426e8ff685452d9e7b1cb73e3ca9237d26b00727ff233305d714b27e","7b316f1cd2190e56cc885565ac56b76f3a25a283c8e45f6d378a9af8a2c3c857","b1caa3283a158982b4dd91e566ade8c6f36202a1e68ffdc28c25880d19898d14","afb0a547b132fdddc420539161ef884d941533032e7ca2f57bb5b4c20e97a99f"],"20000000","17031abe","66b39a61",false]}
₿ (149655149) create_jobs_task: New Work Dequeued da1e63
₿ (149656409) bm1366Module: Serial RX invalid 11
₿ (149656409) bm1366Module: 85 aa 55 c8 00 ef ce 00 66 5a 16
₿ (149660479) bm1366Module: Serial RX invalid 11
₿ (149660489) bm1366Module: 94 aa 55 74 00 9a ea 01 77 5f 77
₿ (149662859) bm1366Module: Serial RX invalid 11
₿ (149662859) bm1366Module: 96 aa 55 40 02 9f 4e 01 7d 7a 75
₿ (149663089) bm1366Module: Serial RX invalid 11
₿ (149663089) bm1366Module: 93 aa 55 32 01 d2 ae 02 7d 8a 75
₿ (149666159) bm1366Module: Serial RX invalid 11
₿ (149666159) bm1366Module: 81 aa 55 bc 03 2d 2d 02 0d 48 2d
₿ (149668049) bm1366Module: Serial RX invalid 11
₿ (149668049) bm1366Module: 97 aa 55 ca 02 6d 22 01 10 40 28
skot commented 1 month ago

Got it! I blocked all network traffic to my bitaxe 401.. it kept hashing on generated work in the queue for a while, until the ASIC just stopped sending nonces. I re-enabled network traffic to the Bitaxe, and it started mining for a while, but then got borked. Here is right where it stopped working (raw rx bytes shown);

rx: [AA 55 52 00 30 25 00 91 00 B1 93]

I (62437861) bm1368Module: Job ID: 48, Core: 41/1, Ver: 00162000
I (62437861) asic_result: Ver: 20162000 Nonce 25300052 diff 0.0 of 1000.
rx: [AA 55 7A 01 76 B4 00 8B 0B DB 9A]

I (62437871) bm1368Module: Job ID: 40, Core: 61/11, Ver: 017B6000
I (62437881) asic_result: Ver: 217B6000 Nonce B476017A diff 0.0 of 1000.
rx: [53 04 5E 1B 2E 8F AA 55 6C 00 75]

I (62437891) bm1368Module: Serial RX invalid 11
I (62437901) bm1368Module: 53 04 5e 1b 2e 8f aa 55 6c 00 75 
rx: [CA 00 46 0C D6 86 AA 55 92 01 D2]

I (62437911) bm1368Module: Serial RX invalid 11
I (62437911) bm1368Module: ca 00 46 0c d6 86 aa 55 92 01 d2 
rx: [66 02 0C 28 20 9D AA 55 7A 02 EE]

I (62437921) bm1368Module: Serial RX invalid 11
I (62437931) bm1368Module: 66 02 0c 28 20 9d aa 55 7a 02 ee 

now the hashrate has gone wild;

image
WantClue commented 1 month ago

So this might be cause by the dns lookup and missing handling 🤔

Georges760 commented 1 month ago

image this is the only indication from the dashboard, all other stats look good. Log shows tons of Serial RX invalid 11

₿ (149651229) bm1366Module: 9e aa 55 a8 00 51 94 02 55 05 dd
₿ (149651719) bm1366Module: Serial RX invalid 11
₿ (149651729) bm1366Module: 91 aa 55 08 00 68 4c 00 53 29 23
₿ (149653189) bm1366Module: Serial RX invalid 11
₿ (149653189) bm1366Module: 90 aa 55 70 02 0d 82 01 59 03 21
₿ (149653249) stratum_task: rx: {"id":null,"method":"mining.notify","params":["da1e63","4d9862332c27aaf3293856481f2ac7f71d6762c50001848e0000000000000000","02000000010000000000000000000000000000000000000000000000000000000000000000ffffffff1703ea0e0d5075626c69632d506f6f6c","ffffffff021d7df51200000000160014c64b1b9283ba1ea86bb9e7b696b0c8f68dad04000000000000000000266a24aa21a9ed4ca029d391eb165ea7dc0a0d4280ac260fd2ce2f86d678ff70640eeadeae0fed00000000",["f1cb32c85599d2c5a793b6ad6b11497f12c242e9055e3a312ebf1b62142d4e3a","e7d63762b78730203129046f64e1be4e4c077c67ad635fd7320306cdbfef9c23","1f7615138e4cc031c3d3139445a8f212b7e5c1c0c6d20d322b03fca02a404b63","cad7df613660d2b5c56650e1403e91d0fd96986fd7582b86bb7ed081b95ae7d9","3113d0a63c3ea791dcecbf6b46740341149f420aa1fc4fd4be49f8a21c8f026d","91167e0fb95a1638aa931b3de009a796417f835bc20899d792af0dca40348f99","109e75786424c730d25a3c58b1e431d0b7690dac7a55d2d5fd253a82f0d04be3","87a92f67ac5d54d85ffe7736e79812c7f761d7bbbde94907696bb2041fe7ef19","46fbd352426e8ff685452d9e7b1cb73e3ca9237d26b00727ff233305d714b27e","7b316f1cd2190e56cc885565ac56b76f3a25a283c8e45f6d378a9af8a2c3c857","b1caa3283a158982b4dd91e566ade8c6f36202a1e68ffdc28c25880d19898d14","afb0a547b132fdddc420539161ef884d941533032e7ca2f57bb5b4c20e97a99f"],"20000000","17031abe","66b39a61",false]}
₿ (149655149) create_jobs_task: New Work Dequeued da1e63
₿ (149656409) bm1366Module: Serial RX invalid 11
₿ (149656409) bm1366Module: 85 aa 55 c8 00 ef ce 00 66 5a 16
₿ (149660479) bm1366Module: Serial RX invalid 11
₿ (149660489) bm1366Module: 94 aa 55 74 00 9a ea 01 77 5f 77
₿ (149662859) bm1366Module: Serial RX invalid 11
₿ (149662859) bm1366Module: 96 aa 55 40 02 9f 4e 01 7d 7a 75
₿ (149663089) bm1366Module: Serial RX invalid 11
₿ (149663089) bm1366Module: 93 aa 55 32 01 d2 ae 02 7d 8a 75
₿ (149666159) bm1366Module: Serial RX invalid 11
₿ (149666159) bm1366Module: 81 aa 55 bc 03 2d 2d 02 0d 48 2d
₿ (149668049) bm1366Module: Serial RX invalid 11
₿ (149668049) bm1366Module: 97 aa 55 ca 02 6d 22 01 10 40 28

here the aa 55 that should be at the begining of the frame is sifted by 1

Georges760 commented 1 month ago

Got it! I blocked all network traffic to my bitaxe 401.. it kept hashing on generated work in the queue for a while, until the ASIC just stopped sending nonces. I re-enabled network traffic to the Bitaxe, and it started mining for a while, but then got borked. Here is right where it stopped working (raw rx bytes shown);

rx: [AA 55 52 00 30 25 00 91 00 B1 93]

I (62437861) bm1368Module: Job ID: 48, Core: 41/1, Ver: 00162000
I (62437861) asic_result: Ver: 20162000 Nonce 25300052 diff 0.0 of 1000.
rx: [AA 55 7A 01 76 B4 00 8B 0B DB 9A]

I (62437871) bm1368Module: Job ID: 40, Core: 61/11, Ver: 017B6000
I (62437881) asic_result: Ver: 217B6000 Nonce B476017A diff 0.0 of 1000.
rx: [53 04 5E 1B 2E 8F AA 55 6C 00 75]

I (62437891) bm1368Module: Serial RX invalid 11
I (62437901) bm1368Module: 53 04 5e 1b 2e 8f aa 55 6c 00 75 
rx: [CA 00 46 0C D6 86 AA 55 92 01 D2]

I (62437911) bm1368Module: Serial RX invalid 11
I (62437911) bm1368Module: ca 00 46 0c d6 86 aa 55 92 01 d2 
rx: [66 02 0C 28 20 9D AA 55 7A 02 EE]

I (62437921) bm1368Module: Serial RX invalid 11
I (62437931) bm1368Module: 66 02 0c 28 20 9d aa 55 7a 02 ee 

now the hashrate has gone wild; image

and here shifted by 6

Georges760 commented 1 month ago

from a random Saleae Captre of a BM1368 (thanks for the yesterday donator!) I can see this kind of Nonce sent by the chip

image

For whatever reason chip randomly sent a frame with 1 extra byte (all other 10k+ nonce frame have the good lenght)

So if this happen, current ESP-Miner which is framing the RX with a fixed size of frame, will never resync to the aa 55.

skot commented 1 month ago

I made a change to the serial parser in BM1366.c and BM1368.c so that it flushes the buffer after any invalid serial RX (ie doesn't start with AA 55). From my testing so far it seems to be working. https://github.com/skot/ESP-Miner/tree/serialrx11_fix

I also have been keeping an eye on the size of the serial buffer. it seems like at some point esp-miner stops emptying the ESP32 serial RX buffer.. need to figure out why that happens.

skot commented 1 month ago

I added a fix for this and some other memory leaks in https://github.com/skot/ESP-Miner/tree/219-leak_hunting

esp-miner.bin.zip

give it a try and see how it holds up!

HypeLaser commented 1 month ago

I added a fix for this and some other memory leaks in https://github.com/skot/ESP-Miner/tree/219-leak_hunting

esp-miner.bin.zip

give it a try and see how it holds up!

Thank you for your efforts. I have updated the firmware to your version above, and will keep an eye and see what happens.

HypeLaser commented 1 month ago

Update: So far run for 24hrs with no reboots and no Serial RX errors.

HypeLaser commented 1 month ago

Further update: The three devices connected to Public Pool are still running, however the one connected to Dutch.nl has got stuck and I had to reset it . The logs only show "http_server: Handshake done, the new connection was opened".

Also, as a side note, the three Public Pool devices have been running solidly for over two days. But I've noticed they're not hitting difficulties any higher than 44 million. Nonce issue?